Decoding Silent Speech in Japanese from Single Trial EEGS: Preliminary Results

Copyright: © 2015 Yamaguchi H, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.


Introduction
The decipherment of human thought from brain activity, without recourse to speech or action, is one of the most attractive and challenging frontiers of modern science. In particular, silent speech recognition systems (SSRSs) enable speech communication to be needed when an audible acoustic signal is unavailable [1]. In addition to "physical" SSRSs [2][3][4][5], in the "electrical" ones, articulation may be inferred from actuator muscle signals or predicted using command signals obtained directly from the brain. Especially, the latter could be speech prosthesis for individuals with severe communication impairments. Electrocorticography (ECoG) recorded during speech production attempts have increasingly yielded the decoding of phonemes and words [6,7] and artificial speech synthesizers [8,9]. However, the SSRSs using non-invasively recorded brain activity, such as scalp-recorded EEGs [10][11][12][13][14], functional magnetic resonance imaging (fMRI) [15] and functional near infrared spectroscopy (fNIRS) [16], had been still in the experimental stage, and limited almost to vowel recognition. Therefore, we propose a new scheme for a speaker-dependent SSRS using singletrial scalp-recorded EEGs for silent vowel recognition, and generalize to consonant one in Japanese. The scheme consists of two phases (learning and decoding ones). In order to exemplify this scheme, we carried out two experiments (Experiments I and II).

Subjects, tasks and electrical recordings
Ten healthy student volunteers (two females and eight males; mean age: 23.7 ± 1.42 years) participated in Experiment I, whose procedures were approved by the Ethics Committee for Human Subject Research, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology. Informed consents were obtained from all the students in writing for the procedures prior to the experiment. All the subjects were right-handed according to the Edinburgh inventory [17]. The subjects were requested to speak "rock", "paper" or "scissors" (/ gu:/, /pa:/ or /tʃɔki/ in English pronunciation of Japanese, respectively) into a microphone (MS-STM87SV, ELECOM CO., LTD., Japan) in the learning phase or to silently speak it in the decoding phase, according to visual cues. After the subjects gazed for 3 s a point presented at the center of a monitor 62 cm away from the subjects, a line drawing of a hand indicating "rock", "paper" or "scissors" was presented for the next 3 s. Only the fixation point was presented for the next 3 s. Then, when the point disappeared, the subjects overtly or covertly spoke "rock", "paper" or "scissors" corresponding to the line drawing presented just before (Figure 1). The line drawings were randomly presented ten times for each janken. Nineteen active electrodes (AP-C100-0155, DIGITEX LAB. CO., LTD., Japan) were affixed to the scalp according to the International 10-20 system. Additive six channels were included for electromyograms (EMGs) and electrooculograms (EOGs), so that

Abstract
We propose a new scheme for speaker-dependent silent speech recognition systems (SSRSs) using both single-trial electroencephalograms (EEGs) scalp-recorded and speech signals measured during overtly and covertly speaking "janken" and "season" in Japanese. This scheme consists of two phases. The learning phase specifies a Kalman filter using spectrograms of the speech signals and independent components (ICs), whose equivalent current dipole source localization (ECDL) solutions were located mainly at the Broca's area, of the EEGs during the actual speech. In case of the "season" task, the speech signals were transformed into vowel and consonant sequences, and these relationships were learned by hidden Markov model (HMM) with Gaussian mixture densities. The decoding phase predicts spectrograms for the silent "janken" and "season" using the Kalman filter with the EEGs during the silent speech. For the silent "season", the predicted spectrograms were inputted to the HMM, and which "season" was silently spoken was determined by the maximal log-likelihood among each HMM. Our preliminary results as training steps are as follows: the silent "jankens" were correctly discriminated; the silent "season"-HMMs worked well, suggesting that this scheme might be applied to the discrimination between all the pairs of the hiraganas. Independent EEG sources obtained by ICA are dipolar [24]. ECDL was applied to the reconstructed EEGs, namely the projection of each of the rest ICs on the scalp surface by the deflation procedure, using "SynaCenterPro" (PC-based commercial software for multiple ECDL) (NEC corporation). This software estimates unconstrained dipoles at any timepoint [25], using the three-layered concentric sphere head model by the nonlinear optimization methods [26]. An unconstrained dipole was estimated at any timepoint with maximal peak or trough in the EEGs reconstructed by the deflation procedure for each IC.
Here, we searched for appropriate and reliable dipole solutions, by selecting localization results only with goodness of fit (GOF) of more than 90% and with the simplified confidence limits (CLs) of less than 1 mm, by restricting to the results with no drastic change in the brain sites where the unconstrained dipoles are located at least twenty successive instants including the peak or trough, and by excluding the ECDL results localized to the cerebral ventricles and the corpus callosum.
Anatomical labeling of the brain where ECDs were located, using the Japanese brain atlas for a single subject, was automatically carried out in the following: each subject's MRI was transformed into the atlas, then the estimated ECDs were projected onto the atlas by this nonlinear transformation, and finally anatomical labels on the atlas were face, mouth and eye movements were monitored. The EEGs recorded at each electrode were fed to an amplifier (Polymate AP1132, DIGITEX LAB. CO., LTD., Japan) with 10000 gain and a notch filter of 60 Hz. The amplified EEGs were sampled at a rate of 1 kHz during an epoch of 3 s preceding and 3 s following each stimulus presentation. The online A/D converted EEG data was immediately stored on a hard disk in a personal computer ( Figure 2). Note that speech signals collected by the microphone were digitalized and, if necessary, down sampled by Audacity (a free software for recording and editing sounds: http:// audacity.sourceforge.net/), and transformed into spectrograms by WaveSurfer (a free audio and video software: http://www.speech.kth. se/wavesurfer/). Six healthy student volunteers at the age of 23 to 27 (one female) participated in Experiment II, where a landscape photograph being associated with "spring", "summer", "autumn" or "winter" was presented. Task paradigm and time-scheduling of speech signals, EEGs and EMGs were the same as in Experiment I, except for 13-ch EEG recordings (F3, F5, F4, F6, F7, F8, FC3, FCz, FC4, C3, Cz, C4, POz).

Grand averages
The grand average for the actual janken was obtained by the summation time-locked to the EMG onsets of the speech signals, and one for the silent tasks time-locked to average EMG onset of the signals for each task and subject. About 900 epochs were used for these grand averages, because three sessions were carried out, each of which included 10 trials for each janken. In the latter task, the EEGs were eliminated from the averaging if the subjects overtly spoke by mistake.

ICA and ECDL
In the learning phase (Figure 3), independent component analysis (ICA) was applied to the single trial EEGs obtained. ICALAB: http:// www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/; was used to apply the fast fixed-point ICA algorithm to the 19-ch EEGs, together with a MATLAB toolbox [18]. Then, independent components (ICs) were extracted so that their equivalent current dipole source localization (ECDL) solutions were localized to the primary motor and premotor cortices, supplementary motor area (SMA) and/or Broca's area (BA) (Figure 4), with reference to the previous neuroimaging studies during overt articulation related to speech production [19][20][21][22][23].     [27]. Less electrode configurations in Experiment II were selected so that it would be easier to obtain BA-ICs, on the basis of the recent finding [28].

Kalman filter
Next, according to the hypothesis, assumed in Directions Into Velocities of Articulators (DIVA) model [8,29], that neurons in the left ventral premotor cortex present intended speech sounds in terms of formant frequency trajectories and projections from these neurons to the primary motor cortex transform the intended formant trajectories into motor commands to the speech articulators, the relationship between the extracted ICs and spectrograms of the speech signals was described by a Kalman filter. The filter was given by where the parameters A, C, w and v were those to be estimated [30], where x t is the two-dimensional vector consisting of the first (F1) and second (F2) formant frequencies, y t is the one-dimensional vector representing one IC, the matrix A describes the relationship between past and future formant frequencies, C describes the expectation of the reconstructed EEGs given a set of formant frequencies and the error terms w t and v t are white Gaussian random variables.
In the decoding phase ( Figure 5), the inputs to the Kalman filter specified in the learning phase were the ICs whose dipole solutions were located at the premotor cortex, SMA and/or Broca's area, according to the previous neuroimaging studies related to silent speech [19,[31][32][33][34], and the filter estimated spectrograms for the silent speeches using the so-called Kalman filter algorithm [35]. The 0 ms on the EEGs was defined to be average EMG onset in the learning phase for each subject ( Figure 5).

HMM construction
The above SSRS for silent janken was constructed in terms of vowel recognition. Therefore, for example, spring ("haru" in English pronunciation of Japanese) and summer ("natsu" in English one) could not be discriminated by the SSRS, because the vowel transitions are the same. In order to cope with this problem, Experiment II was designated. In the learning phase for Experiment II, speech signals were transformed into vowel and consonant sequences, and these transitions were learned by hidden Markov model (HMM), in addition to spectrograms. In the decoding phase, the inputs to the HMM are spectrograms estimated from the Kalman filter specified in the learning phase. Which season was silently spoken was determined by the maximal likelihood amo ng each HMM output in the following.
For example, the present "spring"-HMM is the left-to-right one shown in Figure 6 ; ..... , i j v i N π π the initial state probability vector. Figure 6 exemplifies N=5 and M=4. In the learning phase, each vowel and consonant occurrence is segmented into N states. This segmentation is achieved by finding the optimum state sequence, via the Viterbi algorithm with ( ) j t b x modeled by Gaussian mixture densities, in addition to the initialization of π i and a ij . Parameters in the densities are estimated from spectrograms for ten actual speech trials after K-means clustering [36]. That is, for i=1, 2,…, N, is obtained [36], where F is a set of final states. Thus, the HMM parameters were initialized by the Viterbi algorithm and then re-estimated by the Baum-Welch algorithm. These procedures were carried out by HTK (a portable toolkit for building and manipulating HMMs in C: http://htk. eng.cam.ac.uk/). In the decoding phase, a silently spoken season was assumed to be maximal among each season-HMM likelihood value for the predicted spectrogram from the Kalman filter. Figure 7 shows the grand averages for the actual and silent janken tasks. This figure reveals similar BP-like component both for the actual and silent speeches, while motor potential (MP)-like ones [37] for only the actual one. Therefore, for both the tasks, we should pay attention to Bereitschaftspotential (BP)-like components [38].

Estimated spectrograms for the silent janken
As the training performance, in case of one subject, diagonal parts of Figure 8 show the predicted spectrograms in the F1-F2 plane for the silent "rock", "paper" and "scissors" with ellipsoidal distributions of five Japanese vowels [39]. In case of "rock" (/gu:/) and "paper" (/ pa:/), when the formant frequency trajectories reach the /u/ and /a/ regions, respectively, the predictions were considered to be correct, while, in case of "scissors" (/tʃɔki/), the trajectory was regarded as right if it passed through the region /o (ɔ)/ then the distribution /i/. In terms of Japanese pronunciation, a main difference between "scissors" and the others is that the former has two different vowels, and the latter one. To incorporate this difference in the Kalman filter algorithm, the initial values of the covariance matrix [35] were set to be variances of F1 and F2 and their covariance. Figure 9 plots all the spectrograms for each janken in the F1-F2 plane, including the covariance matrices. Note that the covariance for "scissors" was much larger than those for "rock" and "paper". The diagonal parts of Figure 8 shows the outputs from the Kalman filter algorithm with the initial values (V) depicted in Figure 9, and all indicates the correct predictions. The same tendencies as in Figure 8 were obtained for all the rest subjects. The rest of Figure  8 exemplifies the misapplication of our predictors. For example, the "rock" predictor correctly estimated only for the silent "rock" EEGs. Table 2 shows a confusion matrix for the silent "season" tasks in terms of HMM log-likelihood values. For example, at the first row (Silent "spring" EEG), for the estimated silent "spring" spectrograms, the loglikelihood of the "spring"-HMM was higher than those of the other-HMMs. The other rows demonstrate the same tendency. Therefore, if higher log-likelihood values are accepted, it could be demonstrated that these HMMs work well. As a preliminary result, the accuracy was 86% ("spring"), 29% ("summer"), 43% ("autumn") and 100% ("winter") for one subject.

Discussion and Future Outlook
In order to decode silent speeches from single trial EEGs, we used Kalman filters for the vowel recognition, and HMMs for continuous speech one including consonants. The performance of the present Kalman filters might be improved in the following.

Three-dimensional kalman filter
By constructing three-dimensional Kalman filter, that is, involving F3, we obtained more discriminative results for the silent "rock" and "paper" tasks ( Figure 10

Future research
Since Japanese has syllable-timed rhythm [42], the present method could be easily generalized to all the pairs of hiraganas. However, because the present results are limited to the training performance,

Kalman filter using two ICs
Intrinsically, vowels and consonants are known to be processed by distinct neural mechanisms [40]. For example, vowels and consonants increased activation in right middle temporal and frontal areas, respectively [41]. Tentatively, we constructed a Kalman filter with one IC whose dipole solution was located at the temporal area, in addition to the frontal-area-dopole IC, in the learning phase, and then the silent "haru" spectrogram was estimated in the learning phase. Figure 11 (A), (B) and (C) show spectrograms obtained by the Kalman filter with only one IC, that with the above two ICs and that with two ICs whose dipoles were localized to the other areas, respectively. Figure 11 (B) revealed the best performance.

Practical problem
In practice, it is unknown during which silent task EEGs were recorded. So, using all the estimated spectrograms that were obtained by all the KFs with such EEGs, all the HMMS outputted log-likelihoods. Table 3 shows a confusion matrix for one trial by one subject. This table indicates that the present method worked well. Even if EEGs would be able to be recorded when a subject attempted to voice, speech signals might not be measured. In this case, we could obtain the speech signal by a person physically fitted to the subject Figure 11: Spectrograms obtained from the Kalman filter with only one IC (A) that with two ICs corresponding to Broca's (B) Wernicke's areas and (C) that with two ICs whose dipoles were located at the other areas. Table 3: A confusion matrix for the "spring"-and "summer"-HMMs with respect to log-likelihoods in case of unknown silent speech EEGs.