PHPで文字化けを解消したいです

前提・実現したいこと

PHPのfile_get_contentsを通じて、指定サイトの
・タイトル
・概要
・画像（OGPなければ最初の画像）
を取得したいです。

発生している問題

該当のソースコードの
$html = file_get_contents( $url );
で文字化けしてしまい、これを解消したいです。

$htmlから最終的に取得される$dataですが、titleとexcerptが文字化けしたままで解消できません。

該当のソースコード

まず何の対策もない状態のソースコードになります。
コメントアウトにございます「文字化け対策その１」と「その２」を試したこととして後述いたします。

php
1<?php
2$url = "https://www.gincli.jp/news/index.html"; // 問題なし
3$url = "http://affiliate.rakuten.co.jp/trend//?l-id=top_keyword_sonota"; // 文字化けする
4var_export( myGet( $url ) );
5
6function myGet( $url ){
7    $html = file_get_contents( $url );
8
9	if ($html <> '') {
10	    
11		/*******************「文字化け対策その１」ここから*/
12		
13		// この部分に記載するも実現できず
14		
15		/*ここまで**************************************/
16			
17		// HEADタグ（METAタグ解析）
18		$head = null;
19		$tags = null;
20		if (preg_match('/<\s*head[^>]*>(.*)<\s*/head\s*>/si', $html, $m)) {
21			$head = $m[1];
22			$tags = GetMeta($head);
23			var_dump($tags);
24		}
25		
26		// タイトル
27		if (isset( $tags['og:title'] )	&&	$tags['og:title'] ) {
28			$title = $tags['og:title'] ;
29		} elseif (isset( $tags['twitter:title'] ) && $tags['twitter:title']	) {
30			$title = $tags['twitter:title'] ;
31		} elseif (isset( $tags['title'] ) && $tags['title'] ) {
32			$title = $tags['title'] ;
33		}
34		
35		// 抜粋文・概要文
36		if (isset( $tags['og:description'] ) && $tags['og:description'] ) {
37			$excerpt = $tags['og:description'] ;
38		} elseif (isset( $tags['twitter:description'] )	&& $tags['twitter:description']	) {
39			$excerpt = $tags['twitter:description']	;
40		} elseif (isset( $tags['description'] )	&& $tags['description'] ) {
41			$excerpt = $tags['description'] ;
42		}
43		
44		// OGPから画像URL取得
45		if (isset( $tags['og:image'] ) && $tags['og:image'] ) {
46			$thumbnail_url = $tags['og:image'] ;
47		} elseif (isset( $tags['twitter:image'] ) && $tags['twitter:image']	) {
48			$thumbnail_url = $tags['twitter:image'] ;
49		} else {
50            $output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', $html, $matches);
51            $first_img = $matches [1] [0];		    
52			$thumbnail_url = $first_img;
53		}
54		if ($thumbnail_url && !preg_match('/^https*:///', $thumbnail_url, $m) ) {
55			$thumbnail_url = RelToURL($url, $thumbnail_url);
56		}
57
58		// サイト名
59		if (isset( $tags['og:site_name'] ) && $tags['og:site_name']	) {
60			$site_name	=	$tags['og:site_name']	;
61		}
62		
63		// タイトル整形
64		if (isset($title)) {
65			$str	= $title;
66			$str	= strip_tags($str);									// タグの除去
67			$str	= str_replace(array("\r", "\n"),	'', $str);		 // 改行削除
68			$str	= mb_strimwidth($str, 0, 200, '...');				  // 保管用のタイトルは200文字で切る
69			$title	= $str;
70		}
71		
72		// 抜粋文整形
73		if (isset($excerpt)) {
74			$str	= $excerpt;
75			$str	= strip_tags($str);									// タグの除去
76			$str	= str_replace(array("\r", "\n"),	'', $str);		 // 改行削除
77			$str	= mb_strimwidth($str, 0, 500, '...');				  // 保管用の記事内容は500文字で切る
78			$excerpt	= $str;
79		}
80
81		/*******************「文字化け対策その２」ここから*?
82		
83		// この部分に記載するも実現できず
84		
85		/*ここまで**************************************/
86        
87		// データセット
88		if (isset($data_id) && !is_null($data_id)) {
89			$data['id']			= $data_id;
90		}
91		if (isset($url_key) && !is_null($url_key)) {
92			$data['url_key']	= $url_key;
93		}
94		$data['site_name']		=	$site_name     ?? 'error';
95		$data['title']			=	$title         ?? 'error';
96		$data['excerpt']		=	$excerpt       ?? 'error';
97		$data['charset']		=	$charset       ?? 'error';
98		$data['thumbnail_url']  =   $thumbnail_url ?? 'error';
99	}
100    
101    return $data;
102}
103
104// TITLEとMETAタグを分解
105function GetMeta( $html, $tags = null, $clear = false ) {
106	if ($clear == true || !isset($tags)) {
107		$tags = null;
108		$tags = array('none' => 'none');
109	}
110	
111	// TITLEタグ
112	if (preg_match('/<\s*title\s*[^>]*>\s*([^<]*)\s*<\s*/title\s*[^>]*>/si', $html, $m)) {
113		//$tags['title'] = esc_html($m[1]);
114		$tags['title'] = $m[1];
115	}
116	
117	// metaタグ パース
118	$match = null;
119	preg_match_all('/<\s*meta\s(?=[^>]*?\b(?:name|property)\s*=\s*(?|"\s*([^"]*?)\s*"|\'\s*([^\']*?)\s*\'|([^"\'>]*?)(?=\s*/?\s*>|\s\w+\s*=)))[^>]*?\bcontent\s*=\s*(?|"\s*([^"]*?)\s*"|\'\s*([^\']*?)\s*\'|([^"\'>]*?)(?=\s*/?\s*>|\s\w+\s*=))[^>]*>/is', $html, $match);
120	if (isset($match) && is_array($match) && count($match) == 3 && count($match[1]) > 0) {
121		foreach ($match[1] as &$m) {
122			$m	= strtolower($m);
123		}
124		unset($m);
125		$tags += array_combine($match[1], $match[2]);
126	}
127	
128	// linkタグ パース
129	$match = null;
130	preg_match_all('/<\s*link\s(?=[^>]*?\brel\s*=\s*(?|"\s*([^"]*?)\s*"|\'\s*([^\']*?)\s*\'|([^"\'>]*?)(?=\s*/?\s*>|\s\w+\s*=)))[^>]*?\bhref\s*=\s*(?|"\s*([^"]*?)\s*"|\'\s*([^\']*?)\s*\'|([^"\'>]*?)(?=\s*/?\s*>|\s\w+\s*=))[^>]*>/is', $html, $match);
131	if (isset($match) && is_array($match) && count($match) == 3 && count($match[1]) > 0) {
132		foreach ($match[1] as &$m) {
133			$m	= strtolower($m);
134		}
135		unset($m);
136		$tags += array_combine($match[1], $match[2]);
137	}
138	
139	return $tags;
140}
141
142// 相対パスをURLにする
143function RelToURL( $base_url = '', $rel_path = '' ) {
144	if (preg_match('/^https?\:///', $rel_path ) ) {	// 絶対パスだった場合
145		return	$rel_path;
146	} elseif (substr($rel_path, 0, 2) == '//' ) {       // 絶対パスだった場合（スキーム省略）
147		return	$rel_path;
148	}
149	$parse = parse_url($base_url );
150	if (substr($rel_path, 0, 1) == '/' ) {              // ドキュメントルート指定
151		return	$parse['scheme'].'://'.$parse ['host'].$rel_path;
152	}
153	return $parse['scheme'].'://'.$parse['host'].dirname($parse['path'] ).'/'.$rel_path;
154}
155
156
157

試したこと

まずは$html全体を対象に「文字化け対策その１」を試みましたのが次のコードです。
これを上記コメントアウト部分に、次のように記載致しましたが実現できませんでした。

php
1
2
3		/*******************「文字化け対策その１」ここから*/
4		
5		$charset = null;
6		$detects = array('UTF-8','SJIS','EUC-JP','eucJP-win','ASCII','JIS','SJIS-win');
7		if (preg_match('/charset\s*=\s*"*([^>/\s"]*).*</head/si', $html, $m)) {
8			$m[1] = trim(trim($m[1]), '\'\"');
9			$charset = $m[1];
10		} else {
11			foreach( $detects as $c_charset) {
12				// 文字コード変換してみて内容が変わらないものを文字セットと判断する
13				if (mb_convert_encoding($html, $charset, $c_charset) == $html) {
14					$charset = $c_charset;
15					break;
16				}
17			}
18		}
19		if (is_null($charset)) {
20			$charset = mb_detect_encoding($html, 'ASCII,JIS,UTF-7,EUC-JP,SJIS,UTF-8');
21			$html = mb_convert_encoding($html, $charset, 'ASCII,JIS,UTF-7,EUC-JP,SJIS,UTF-8');
22		} elseif ($charset <> $charset) {
23			$html = mb_convert_encoding($html, $charset, $charset);
24		}
25
26		/*ここまで**************************************/
27

そして$html全体でなく、とりあえず$titleだけではどうかと思い試したのが下記「文字化け対策その２」になりますがこちらも実現できず、やむなく質問させて頂きました。

php
1		/*******************「文字化け対策その２」ここから*/
2		
3		$title_check = utf8_decode($title);
4		if(mb_detect_encoding($title_check) == 'UTF-8'){
5			$title = $title_check; // 文字化け解消
6		}
7		// UTF-8以外の文字コードが渡ってきてた場合、UTF-8に変換する
8		if(mb_detect_encoding($title) != 'UTF-8'){
9			$title = mb_convert_encoding($title, 'UTF-8', mb_detect_encoding($title, $detects, true));
10		}
11		
12		/*ここまで**************************************/
13

titleとexcerptの文字化けについて、解消方法がわかる方がもしいらっしゃいましたらご教示頂きたく存じます。
また、その他コードの手落ちなどお目につく点ございましたら併せてご指導よろしくお願い致します。

行動規範の内容に同意します

回答1件

https://teratail.com/questions/205105

投稿2020/06/20 06:23

kai0310

総合スコア2070

HARDTHINGS

2020/06/20 06:48

せっかくですが…、恐れいりますが意図がつかめませんでした。リンク先にございますような変換はmb_convert_encodingで試みておりますので、もしその使い方の不備をご認識でしたら、もう少々具体的にご指摘頂けましたら幸いです。尚、特に質問の試みをご覧にならずリンクを貼っただけということであれば特にご返答は結構ですのでどうぞお気になさらず。