VRアプリでの音声入力の実装

前提・実現したいこと

こんばんは
現在google cardboardとunityを用いてごく簡単なandroid向けのウォークスルー型VRアプリを制作してるのですが
その中で、『ある限定されたエリア内においてプレイヤーが発声した特定の単語に応じてオブジェクトのアニメーションが変化する』
という機能を実装したいのですがその方法が恥ずかしながら皆目見当がつきません。

アプリ内での流れとしましては
『コライダーとプレイヤーの衝突をトリガーとして音声認識がオン』
↓
『特定の単語（転がれ、跳べ等）をプレイヤーが発声』
↓
『発声された単語によってアニメーションが変化』
になります

音声認識APIを用いて音声をテキスト化し、switch文の中で照合させて合致したcase内の処理によってアニメーションを変えるという流れなのでしょうか
その場合でも『そもそもAPIをどの様に用いれば上記の処理が出来るのか』が分からない状態です。

私自身の能力以上の出過ぎたことをしている自覚はありますが、どうしても完成させたいのでどうかよろしくお願いします。

補足情報（FW/ツールのバージョンなど）

Unity 2019.3.14f1

YAmaGNZ

2020/10/14 22:02

もう少し処理を細分化して、何が分からないのかを明確にする必要があるかと思います。

行動規範の内容に同意します

回答1件

ベストアンサー

音声認識APIを用いて音声をテキスト化し、switch文の中で照合させて合致したcase内の処理によってアニメーションを変えるという流れなのでしょうか

その処理で大丈夫です。

『そもそもAPIをどの様に用いれば上記の処理が出来るのか』が分からない

unityで利用できる音声認識APIは以下のようなものがあります。
・UnityEngine.Windows.Speech
・Azure Cognitive Services - Speech Service
・Google Cloud Speech API
・IBM Watson Speech to Text
・Amazon Transcribe

UnityEngine.Windows.Speech が一番簡単に音声認識を実装できるのですがWindowsでしか利用できないので今回はなしです。

Azure Cognitive Services - Speech Service で良さげなQiitaの記事があったのでこれを試してみるのが良いかと思います。（Androidで利用できます。）

Qiita | UnityでMicrosoft Cognitive Speech Servicesによる音声認識をAndroidスマホで実装する
https://qiita.com/tomato_sugar/items/ac4b5dbe8277496add56

途中でわからないことがあれば言ってください。アドバイスできると思います。

追記：Azure Cognitive Speech SDK 導入方法

自分のアカウントのAzure上で Cognitive Services を立てている前提です。
立て方は上記のQiita記事を参考にしてください。

1, Microsoftさん公式のUnityのサンプルプロジェクトをgit cloneする。
https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/quickstart/csharp/unity/from-microphone
※これをzipでDLして quickstart/csharp/unity/from-microphone でもよい。
https://github.com/Azure-Samples/cognitive-services-speech-sdk

2, Microsoft.CognitiveServices.Speech.xxx.unitypackage をDLしてインポートするhttps://aka.ms/csspeech/unitypackage

3, Assets/Scripts/HelloWorld.cs を下の内容に書き換える。
変更点は、キー等の情報をスクリプトに直接記入するようになっていたので[SerializeField]でEditor上から入力できるようにした。あと日本語対応。

C#
1//
2// Copyright (c) Microsoft. All rights reserved.
3// Licensed under the MIT license. See LICENSE.md file in the project root for full license information.
4//
5// <code>
6using UnityEngine;
7using UnityEngine.UI;
8using Microsoft.CognitiveServices.Speech;
9#if PLATFORM_ANDROID
10using UnityEngine.Android;
11#endif
12#if PLATFORM_IOS
13using UnityEngine.iOS;
14using System.Collections;
15#endif
16
17public class HelloWorld : MonoBehaviour
18{
19    // Hook up the two properties below with a Text and Button object in your UI.
20    public Text outputText;
21    public Button startRecoButton;
22
23    private object threadLocker = new object();
24    private bool waitingForReco;
25    private string message;
26
27    private bool micPermissionGranted = false;
28
29#if PLATFORM_ANDROID || PLATFORM_IOS
30    // Required to manifest microphone permission, cf.
31    // https://docs.unity3d.com/Manual/android-manifest.html
32    private Microphone mic;
33#endif
34
35    [SerializeField] string subscriptionKey = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
36    [SerializeField] string serviceRegion = "japaneast";
37
38    public async void ButtonClick()
39    {
40        // Creates an instance of a speech config with specified subscription key and service region.
41        // Replace with your own subscription key and service region (e.g., "westus").
42        var config = SpeechConfig.FromSubscription(subscriptionKey, serviceRegion);
43        var lang = SourceLanguageConfig.FromLanguage("ja-JP");
44
45        // Make sure to dispose the recognizer after use!
46        using (var recognizer = new SpeechRecognizer(config, lang))
47        {
48            lock (threadLocker)
49            {
50                waitingForReco = true;
51            }
52
53            // Starts speech recognition, and returns after a single utterance is recognized. The end of a
54            // single utterance is determined by listening for silence at the end or until a maximum of 15
55            // seconds of audio is processed.  The task returns the recognition text as result.
56            // Note: Since RecognizeOnceAsync() returns only a single utterance, it is suitable only for single
57            // shot recognition like command or query.
58            // For long-running multi-utterance recognition, use StartContinuousRecognitionAsync() instead.
59            var result = await recognizer.RecognizeOnceAsync().ConfigureAwait(false);
60
61            // Checks result.
62            string newMessage = string.Empty;
63            if (result.Reason == ResultReason.RecognizedSpeech)
64            {
65                newMessage = result.Text;
66            }
67            else if (result.Reason == ResultReason.NoMatch)
68            {
69                newMessage = "NOMATCH: Speech could not be recognized.";
70            }
71            else if (result.Reason == ResultReason.Canceled)
72            {
73                var cancellation = CancellationDetails.FromResult(result);
74                newMessage = $"CANCELED: Reason={cancellation.Reason} ErrorDetails={cancellation.ErrorDetails}";
75            }
76
77            lock (threadLocker)
78            {
79                message = newMessage;
80                waitingForReco = false;
81            }
82        }
83    }
84
85    void Start()
86    {
87        if (outputText == null)
88        {
89            UnityEngine.Debug.LogError("outputText property is null! Assign a UI Text element to it.");
90        }
91        else if (startRecoButton == null)
92        {
93            message = "startRecoButton property is null! Assign a UI Button to it.";
94            UnityEngine.Debug.LogError(message);
95        }
96        else
97        {
98            // Continue with normal initialization, Text and Button objects are present.
99#if PLATFORM_ANDROID
100            // Request to use the microphone, cf.
101            // https://docs.unity3d.com/Manual/android-RequestingPermissions.html
102            message = "Waiting for mic permission";
103            if (!Permission.HasUserAuthorizedPermission(Permission.Microphone))
104            {
105                Permission.RequestUserPermission(Permission.Microphone);
106            }
107#elif PLATFORM_IOS
108            if (!Application.HasUserAuthorization(UserAuthorization.Microphone))
109            {
110                Application.RequestUserAuthorization(UserAuthorization.Microphone);
111            }
112#else
113            micPermissionGranted = true;
114            message = "Click button to recognize speech";
115#endif
116            startRecoButton.onClick.AddListener(ButtonClick);
117        }
118    }
119
120    void Update()
121    {
122#if PLATFORM_ANDROID
123        if (!micPermissionGranted && Permission.HasUserAuthorizedPermission(Permission.Microphone))
124        {
125            micPermissionGranted = true;
126            message = "Click button to recognize speech";
127        }
128#elif PLATFORM_IOS
129        if (!micPermissionGranted && Application.HasUserAuthorization(UserAuthorization.Microphone))
130        {
131            micPermissionGranted = true;
132            message = "Click button to recognize speech";
133        }
134#endif
135
136        lock (threadLocker)
137        {
138            if (startRecoButton != null)
139            {
140                startRecoButton.interactable = !waitingForReco && micPermissionGranted;
141            }
142            if (outputText != null)
143            {
144                outputText.text = message;
145            }
146        }
147    }
148}
149// </code>
150

4, 作成済みのリソースから キー1 を subscriptionKey に張り付ける。また 場所 が serviceRegion に相当するので japaneast でなければ書き換える。

5, Build Settings の Target Platform を Android に変更してビルド。実機で動作確認。（Windows でも iOS でもOKです。）

投稿2020/10/17 12:45

編集2020/10/19 16:42

u824

総合スコア112

fiveHundred

2020/10/17 13:27

大筋では大体合っていると思いますが、「跳べ」と「跳ぶ」のようなものとか、「飛べ」のような同音異義語、似た単語の誤認識、「蛙よ跳べ」のように別の単語が含まれている場合などを考慮する必要があるかもしれないので、swicth文だけで出来るような単純な問題ではないような気もします。まあ、とりあえず単純なものから作ってみて、それを改善していけばよろしいかとは思いますが。あと、AndroidやiOSのネイティブ機能にも、音声入力機能があったはずなので、そちらを使うのもありだと思います。敷居は高いと思いますが、APIだと場合によっては有料になるので。

u824

2020/10/17 14:06

>「跳べ」と「跳ぶ」のようなものとか、「飛べ」のような同音異義語、似た単語の誤認識、「蛙よ跳べ」のように別の単語が含まれている場合などを考慮する必要があるその通りですね。なるべく単純な単語にするのがおすすめです。頻出する誤認識ワードがあればそれも条件分岐に加えるのが良いかと思います。 >あと、AndroidやiOSのネイティブ機能にも、音声入力機能があったはずなので、そちらを使うのもありだと思います。気になって調べてみたら良さげなの見つけました。動作確認はしていませんが、利用できそうならこっちの方が良さそうですね。 BOOTH | FantomPlugin v1.18 (Unity × Android) https://booth.pm/ja/items/1556439