aboutsummaryrefslogtreecommitdiffstats
path: root/vendor/pear/text_languagedetect/README.rst
blob: 15fbd87bbd6d7bcbeb749564a0d65d23b449b919 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
*******************
Text_LanguageDetect
*******************
PHP library to identify human languages from text samples.
Returns confidence scores for each.


Installation
============

PEAR
----
::

    $ pear install Text_LanguageDetect

Composer
--------
::

    $ composer require pear/text_languagedetect


Usage
=====
Also see the examples in the ``docs/`` directory and
the `official documentation`__.

__ http://pear.php.net/package/Text_LanguageDetect/docs

Language detection
------------------
Simple language detection::

    <?php
    require_once 'Text/LanguageDetect.php';

    $text = 'Was wäre, wenn ich Ihnen das jetzt sagen würde?';

    $ld = new Text_LanguageDetect();
    $language = $ld->detectSimple($text);

    echo $language;
    //output: german

Show the three most probable languages with their confidence score::

    <?php
    require_once 'Text/LanguageDetect.php';

    $text = 'Was wäre, wenn ich Ihnen das jetzt sagen würde?';

    $ld = new Text_LanguageDetect();
    //3 most probable languages
    $results = $ld->detect($text, 3);

    foreach ($results as $language => $confidence) {
        echo $language . ': ' . number_format($confidence, 2) . "\n";
    }

    //output:
    //german: 0.35
    //dutch: 0.25
    //swedish: 0.20
    ?>


Language code
-------------
Instead of returning the full language name, ISO 639-2 two and three
letter codes can be returned::

    <?php
    require_once 'Text/LanguageDetect.php';
    $ld = new Text_LanguageDetect();

    //will output the ISO 639-1 two-letter language code
    // "de"
    $ld->setNameMode(2);
    echo $ld->detectSimple('Das ist ein kleiner Text') . "\n";

    //will output the ISO 639-2 three-letter language code
    // "deu"
    $ld->setNameMode(3);
    echo $ld->detectSimple('Das ist ein kleiner Text') . "\n";
    ?>


Supported languages
===================
- albanian
- arabic
- azeri
- bengali
- bulgarian
- cebuano
- croatian
- czech
- danish
- dutch
- english
- estonian
- farsi
- finnish
- french
- german
- hausa
- hawaiian
- hindi
- hungarian
- icelandic
- indonesian
- italian
- kazakh
- kyrgyz
- latin
- latvian
- lithuanian
- macedonian
- mongolian
- nepali
- norwegian
- pashto
- pidgin
- polish
- portuguese
- romanian
- russian
- serbian
- slovak
- slovene
- somali
- spanish
- swahili
- swedish
- tagalog
- turkish
- ukrainian
- urdu
- uzbek
- vietnamese
- welsh


Links
=====
Homepage
  http://pear.php.net/package/Text_LanguageDetect
Bug tracker
  http://pear.php.net/bugs/search.php?cmd=display&package_name[]=Text_LanguageDetect
Documentation
  http://pear.php.net/package/Text_LanguageDetect/docs
Unit test status
  https://travis-ci.org/pear/Text_LanguageDetect

  .. image:: https://travis-ci.org/pear/Text_LanguageDetect.svg?branch=master
     :target: https://travis-ci.org/pear/Text_LanguageDetect


Notes
=====
Where are the data from?

 I don't recall where I got the original data set.
 It's just the frequencies of 3-letter combinations in each supported language.
 It could be generated from a few random wikipedia pages from each language.