五月天婷亚洲天久久综合网,婷婷丁香五月激情亚洲综合,久久男人精品女人,麻豆91在线播放

  • <center id="8gusu"></center><rt id="8gusu"></rt>
    <menu id="8gusu"><small id="8gusu"></small></menu>
  • <dd id="8gusu"><s id="8gusu"></s></dd>
    樓主: liuxf666
    619 5

    [學習筆記] 【學習筆記】How to Design a Spelling Corrector [推廣有獎]

    • 1關注
    • 3粉絲

    學科帶頭人

    54%

    還不是VIP/貴賓

    -

    威望
    0
    論壇幣
    13005 個
    通用積分
    409.9229
    學術水平
    109 點
    熱心指數(shù)
    112 點
    信用等級
    103 點
    經驗
    71224 點
    帖子
    1081
    精華
    0
    在線時間
    1538 小時
    注冊時間
    2016-7-19
    最后登錄
    2024-6-8

    樓主
    liuxf666 發(fā)表于 2019-5-15 10:59:13 |只看作者 |壇友微信交流群|倒序 |AI寫論文

    +2 論壇幣
    k人 參與回答

    經管之家送您一份

    應屆畢業(yè)生專屬福利!

    求職就業(yè)群
    趙安豆老師微信:zhaoandou666

    經管之家聯(lián)合CDA

    送您一個全額獎學金名額~ !

    感謝您參與論壇問題回答

    經管之家送您兩個論壇幣!

    +2 論壇幣
    Probability TheoryProblem: find the correction c, out of all possible candidate corrections, that maximizes the probability that c is the intended correction, given the original word w:
    argmaxc ∈ candidates P(c|w)  <==> argmaxc ∈ candidates P(c) P(w|c) / P(w)


    The four parts of this expression are:
    • Selection Mechanism: argmax
      We choose the candidate with the highest combined probability.
    • Candidate Model: c ∈ candidates
      This tells us which candidate corrections, c, to consider.
    • Language Model: P(c)
      The probability that c appears as a word of English text. For example, occurrences of "the" make up about 7% of English text, so we should have P(the) = 0.07.
    • Error Model: P(w|c)
      The probability that w would be typed in a text when the author meant c. For example, P(teh|the) is relatively high, but P(theeexyz|the) would be very low.
    One obvious question is: why take a simple expression like P(c|w) and replace it with a more complex expression involving two models rather than one? The answer is that P(c|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly. Consider the misspelled word w="thew" and the two candidate corrections c="the" and c="thaw". Which has a higher P(c|w)? Well, "thaw" seems good because the only change is "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a very common word, and while adding a "w" seems like a larger, less probable change, perhaps the typist's finger slipped off the "e". The point is that to estimate P(c|w) we have to consider both the probability of cand the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors.
    How It Works in PythonThe four parts of the program are:
    1. Selection Mechanism: In Python, max with a key argument does 'argmax'.


    2. Candidate Model: First a new concept: a simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter). The function edits1 returns a set of all the edited strings (whether words or not) that can be made with one simple edit:


    This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates). For example,
    However, if we restrict ourselves to words that are known—that is, in the dictionary— then the set is much smaller:


    We'll also consider corrections that require two simple edits. This generates a much bigger set of possibilities, but usually only a few of them are known words.


    We say that the results of edits2(w) have an edit distance of 2 from w.
    3. Language Model: We can estimate the probability of a word, P(word), by counting the number of times each word appears in a text file of about a million words, big.txt. It is a concatenation of public domain book excerpts from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. The function words breaks text into words, then the variable WORDS holds a Counter of how often each word appears, and P estimates the probability of each word, based on this Counter.


    4. Error Model: When I started to write this program, sitting on a plane in 2007, I had no data on spelling errors, and no internet connection (I know that may be hard to imagine today). Without data I couldn't build a good spelling error model, so I took a shortcut: I defined a trivial, flawed error model that says all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. So we can make candidates(word) produce the first non-empty list of candidates in order of priority:
    • The original word, if it is known; otherwise
    • The list of known words at edit distance one away, if there are any; otherwise
    • The list of known words at edit distance two away, if there are any; otherwise
    • The original word, even though it is not known.


      Then we don't need to multiply by a P(w|c) factor, because every candidate at the chosen priority will have the same probability (according to our flawed model). That gives us:



    二維碼

    掃碼加我 拉你入群

    請注明:姓名-公司-職位

    以便審核進群資格,未注明則拒絕

    關鍵詞:correct correc Design Corr sign

    已有 1 人評分論壇幣 學術水平 熱心指數(shù) 信用等級 收起 理由
    經管之家編輯部 + 100 + 3 + 3 + 3 精彩帖子

    總評分: 論壇幣 + 100  學術水平 + 3  熱心指數(shù) + 3  信用等級 + 3   查看全部評分

    本帖被以下文庫推薦

    沙發(fā)
    經管之家編輯部 在職認證  發(fā)表于 2019-5-15 11:32:51 |只看作者 |壇友微信交流群
    為您點贊!
    藤椅
    充實每一天 發(fā)表于 2019-5-15 11:36:38 來自手機 |只看作者 |壇友微信交流群
    點贊
    板凳
    tianwk 發(fā)表于 2019-5-15 14:08:19 |只看作者 |壇友微信交流群
    加油加油
    報紙
    jessie68us 發(fā)表于 2019-5-15 23:27:52 |只看作者 |壇友微信交流群
    Selection Mechanism: argmax
    Candidate Model: c ∈ candidates
    Language Model: P(c)
    Error Model: P(w|c)
    四個模型很經典。為您點贊!

    地板
    從1萬到一億 在職認證  發(fā)表于 2019-5-16 15:10:01 |只看作者 |壇友微信交流群
    您需要登錄后才可以回帖 登錄 | 我要注冊

    本版微信群
    加JingGuanBbs
    拉您進交流群

    京ICP備16021002-2號 京B2-20170662號 京公網安備 11010802022788號 論壇法律顧問:王進律師 知識產權保護聲明   免責及隱私聲明

    GMT+8, 2024-12-23 05:49