Automatic Transliteration Of Romanized Dialectal Arabic

4m ago
9 Views
0 Downloads
221.78 KB
9 Pages
Last View : 1m ago
Last Download : n/a
Upload by : Joanna Keil
Transcription

Automatic Transliteration of Romanized Dialectal ArabicMohamed Al-Badrashiny† , Ramy Eskander, Nizar Habash and Owen Rambow†Department of Computer Science, The George Washington University, Washington, DC†[email protected] for Computational Learning Systems, Columbia University, NYC, actestablished conventions to render the sounds of theDA sentence. Because the sound-to-letter rulesof English are very different from those of Arabic, we obtain complex mappings between the twowriting systems. This issue is compounded by theunderlying problem that DA itself does not haveany standard orthography in the Arabic script. Table 1 shows different plausible ways of writing anEgyptian Arabic (EGY) sentence in Arabizi andin Arabic script.Arabizi poses a problem for natural languageprocessing (NLP). While some tools have recentlybecome available for processing EGY input, e.g.,(Habash et al., 2012b; Habash et al., 2013; Pashaet al., 2014), they expect Arabic script input (or aBuckwalter transliteration). They cannot processArabizi. We therefore need a tool that convertsfrom Arabizi to Arabic script. However, the lackof standard orthography in EGY compounds theproblem: what should we convert Arabizi into?Our answer to this question is to use CODA, aconventional orthography created for the purposeof supporting NLP tools (Habash et al., 2012a).The goal of CODA is to reduce the data sparsenessthat comes from the same word form appearing inmany spontaneous orthographies in data (be it annotated or unannotated). CODA has been definedfor EGY as well as Tunisian Arabic (Zribi et al.,2014), and it has been used as part of different approaches for modeling DA morphology (Habashet al., 2012b), tagging (Habash et al., 2013; Pashaet al., 2014) and spelling correction (Eskander etal., 2013; Farra et al., 2014).This paper makes two main contributions. First,we clearly define the computational problem oftransforming Arabizi to CODA. This improvesover previous work by unambiguously fixing theIn this paper, we address the problemof converting Dialectal Arabic (DA) textthat is written in the Latin script (calledArabizi) into Arabic script following theCODA convention for DA orthography.The presented system uses a finite statetransducer trained at the character levelto generate all possible transliterations forthe input Arabizi words. We then filterthe generated list using a DA morphological analyzer. After that we pick the bestchoice for each input word using a language model. We achieve an accuracy of69.4% on an unseen test set compared to63.1% using a system which represents apreviously proposed approach.1IntroductionThe Arabic language is a collection of varieties:Modern Standard Arabic (MSA), which is usedin formal settings and has a standard orthography, and different forms of Dialectal Arabic (DA),which are commonly used informally and with increasing presence on the web, but which do nothave standard orthographies. While both MSAand DA are commonly written in the Arabic script,DA (and less so MSA) is sometimes written inthe Latin script. This happens when using an Arabic keyboard is dispreferred or impossible, for example when communicating from a mobile phonethat has no Arabic script support. Arabic writtenin the Latin script is often referred to as “Arabizi”.Arabizi is not a letter-based transliteration fromthe Arabic script as is, for example, the Buckwalter transliteration (Buckwalter, 2004). Instead,roughly speaking, writers use sound-to-letter rulesinspired by those of English1 as well as informallyguages that natively uses the Latin script, such as Englishor French. In this paper, we concentrate on Egyptian Arabic,which uses English as its main source of sound-to-letter rules.1In different parts of the Arab World, the basis for theLatin script rendering of DA may come from different lan-30Proceedings of the Eighteenth Conference on Computational Language Learning, pages 30–38,Baltimore, Maryland USA, June 26-27 2014. c 2014 Association for Computational Linguistics

target representation for the transformation. Second, we perform experiments using different components in a transformation pipeline, and showthat a combination of character-based transduction, filtering using a morphological analyzer, andusing a language model outperforms other architectures, including the state-of-the-art system described in Darwish (2013). Darwish (2013) presented a conversion tool, but did not discuss conversion into a conventionalized orthography, anddid not investigate different architectures. Weshow in this paper that our proposed architecture,which includes an EGY morphological analyzer,improves over Darwish’s architecture.This paper is structured as follows. We start outby presenting relevant linguistic facts (Section 2)and then we discuss related work. We present ourapproach in Section 4 and our experiments and results in Section 5.22.12.2ArabiziArabizi is a spontaneous orthography used to writeDA using the Latin script, the so-called Arabicnumerals, and other symbols commonly found onvarious input devices such as punctuation. Arabiziis commonly used by Arabic speakers to write insocial media and SMS and chat applications.The orthography decisions made for writingin Arabizi mainly depend on a phoneme-tographeme mapping between the Arabic pronunciation and the Latin script. This is largely basedon the phoneme-to-grapheme mapping used in English. Crucially, Arabizi is not a simple transliteration of Arabic, under which each Arabic letter insome orthography is replaced by a Latin letter (asis the case in the Buckwalter transliteration usedwidely in natural language processing but nowhereelse). As a result, it is not straightforward to convert Arabizi to Arabic. We discuss some specificaspects of Arabizi.Vowels While EGY orthography omits vocalicdiacritics representing short vowels, Arabizi usesthe Latin script symbols for vowels (a, e, i, o, u, y)to represent EGY’s short and long vowels, makingthem ambiguous. In some cases, Arabizi wordsomit short vowels altogether as is done in Arabicorthography.Consonants Another source of ambiguity is theuse of a single Latin letter to refer to multiple Arabic phonemes. For example, the Latin letter "d" isused to represent the sounds of the Arabic lettersX d and D. Additionally, some pairs of Arabiziletters can ambiguously map to a single Arabic letter or pairs of letters: "sh" can be use to represent š or é sh. Arabizi also uses digits to represent some Arabic letters. For example, the digits 2, 3, 5, 6, 7 and 9 are used to represent theHamza (glottal stop), and the sounds of the letters ς, p x, T, h H and S, respectively. However, when followed by "’", the digits 3, 6, 7 and9 change their interpretations to the dotted versionof the Arabic letter: γ, Ď, p x and D, respectively. Moreover, "’" (as well as "q") may alsorefer to the glottal stop.Foreign Words Arabizi contains a large number of foreign words, that are either borrowingssuch as mobile or instances of code switching suchas I love you.Abbreviations Arabizi may also include someabbreviations such as isa which means é Ë@ ZA à@Ǎn šA’ Allh ‘God willing’.Linguistic FactsEGY Spontaneous OrthographyAn orthography is a specification of how to usea particular writing system (script) to write thewords of a particular language. In cases wherethere is no standard orthography, people use aspontaneous orthography that is based on different criteria. The main criterion is phonology: how to render a word pronunciation inthe given writing system.This mainly depends on language-specific assumptions about thegrapheme-to-phoneme mapping. Another criterion is to use cognates in a related language (similar language or a language variant), where twowords represent a cognate if they are related etymologically and have the same meaning. Additionally, a spontaneous orthography may be affected by speech effects, which are the lengthening of specific syllables to show emphasis or othereffects (such as Q JJ J» ktyyyyr 2 ‘veeeery’).EGY has no standard orthography. Instead,it has a spontaneous orthography that is relatedto the standard orthography of Modern StandardArabic. Table 1 shows an example of writing asentence in EGY spontaneous orthography in different variants.2Arabic transliteration is presented in the Habash-SoudiBuckwalter scheme (Habash et al., 2007): (in alphabeticalorder) AbtθjHxdðrzsšSDTĎςγfqklmnhwy and the additional symbols: ’ Z, Â @, Ǎ @, Ā @, ŵ ð', ŷ Zø', h̄ è, ý ø.31

OrthographyCODANon-CODAArabic ScriptArabiziExamplehPAJ.Ó@ áÓ úG. Am AÓmA šftš SHAby mn AmbArH hPAJ.Ó@ áÓ úG. Agñ ñ AÓmAšwftš SwHAbý mn AmbArH [email protected] áÓ úG. Am Ómšftš SHAbý mn ǍnbArH J AÓhPAJ.Ó@ áÓ úG. Am mA šftyš SHAby mn ǍmbArHmashoftesh sohaby men embare7ma shftesh swhabi mn imbarehmshwftish swhaby min ambare7Table 1: The different spelling variants in EGY and Arabizi for writing the sentence "I have not seen myfriends since yesterday" versus its corresponding CODA form.2.33CODACODA is a conventionalized orthography for Dialectal Arabic (Habash et al., 2012a). In CODA,every word has a single orthographic representation. CODA has five key properties (Eskanderet al., 2013). First, CODA is an internally consistent and coherent convention for writing DA.Second, CODA is primarily created for computational purposes, but is easy to learn and recognizeby educated Arabic speakers. Third, CODA usesthe Arabic script as used for MSA, with no extra symbols from, for example, Persian or Urdu.Fourth, CODA is intended as a unified frameworkfor writing all dialects. CODA has been definedfor EGY (Habash et al., 2012a) as well as TunisianArabic (Zribi et al., 2014). Finally, CODA aimsto maintain a level of dialectal uniqueness whileusing conventions based on similarities betweenMSA and the dialects. For a full presentation ofCODA and a justification and explanation of itschoices, see (Habash et al., 2012a).Related WorkOur proposed work has some similarities to Darwish (2013). His work is divided into two sections: language identification and transliteration.He used word and sequence-level features to identify Arabizi that is mixed with English. For Arabicwords, he modeled transliteration from Arabizi toArabic script, and then applied language modeling on the transliterated text. This is similar to ourproposed work in terms of transliteration and language modeling. However, Darwish (2013) doesnot target a conventionalized orthography, whileour system targets CODA. Additionally, Darwish(2013) transliterates Arabic words only after filtering out non-Arabic words, while we transliteratethe whole input Arabizi. Finally, he does not useany morphological information, while we introduce the use of a morphological analyzer to support the transliteration pipeline.Chalabi and Gerges (2012) presented a hybridapproach for Arabizi transliteration. Their workrelies on the use of character transformation rulesthat are either handcrafted by a linguist or automatically generated from training data. Theyalso employ word-based and character-based language models for the final transliteration choice.Like Darwish (2013), the work done by Chalabiand Gerges (2012) is similar to ours except thatit does not target a conventionalized orthography,and does not use deep morphological information,while our system does.CODA has been used as part of different approaches for modeling DA morphology (Habashet al., 2012b), tagging (Habash et al., 2013; Pashaet al., 2014) and spelling correction (Eskander etal., 2013; Farra et al., 2014). Converting Dialectal Arabic (written using a spontaneous Arabic orthography or Arabizi) to CODA is beneficial toNLP applications that better perform on standardized data with less sparsity (Eskander et al., 2013).Table 1 shows the CODA form correspondingto spontaneously written Arabic.There are three commercial products that con32

vert Arabizi to Arabic, namely: Microsoft Maren,3Google Ta3reeb4 and Yamli.5 However, sincethese products are for commercial purposes, thereis not enough information about their approaches.But given their output, it is clear that they donot follow a well-defined standardized orthography like we do. Furthermore, these tools are primarily intended as input method support, not fulltext transliteration. As a result, their users’ goalis to produce Arabic script text not Arabizi text.We expect, for instance, that users of these inputmethod support systems will use less or no codeswitching to English, and they may employ character sequences that help them arrive at the targetArabic script form, which otherwise they wouldnot write if they are targeting Arabizi.Eskander et al. (2013) introduced a systemto convert spontaneous EGY to CODA, calledCODAFY. The difference between CODAFY andour proposed system is that CODAFY works onspontaneous text written in Arabic script, whileour system works on Arabizi, which involves ahigher degree of ambiguity. However, we useCODAFY as a black-box module in our preprocessing.Additionally, there is some work on converting from dialectal Arabic to MSA, which is similar to our work in terms of processing a dialectal input. However, our final output is in EGYand not MSA. Shaalan et al. (2007) introduced arule-based approach to convert EGY to MSA. AlGaphari and Al-Yadoumi (2010) also used a rulebased method to transform from Sanaani dialect toMSA. Sawaf (2010), Salloum and Habash (2011)and Salloum and Habash (2013) used morphological analysis and morphosyntactic transformationrules for processing EGY and Levantine Arabic.There has been some work on machine transliteration by Knight and Graehl (1997). Al-Onaizanand Knight (2002) introduced an approach for machine transliteration of Arabic names. Freemanet al. (2006) also introduced a system for namematching between English and Arabic, whichHabash (2008) employed as part of generatingEnglish transliterations from Arabic words in thecontext of machine translation. This work is similar to ours in terms of text transliteration. However, our work is not restricted to names.4Approach4.1Defining the TaskOur task is as follows: for each Arabizi word inthe input, we choose the Arabic script word whichis the correct CODA spelling of the input wordand which carries the intended meaning (as determined in the context of the entire available text).We do not merge two or more input words intoa single Arabic script word. If CODA requirestwo consecutive input Arabizi words to be merged,we indicate this by attaching a plus to the end ofthe first word. On the other hand, if CODA requires an input Arabizi word to be broken into twoor more Arabic script words, we indicate this byinserting a dash between the words. We do thisto maintain the bijection between input and output words, i.e., to allow easy tracing of the Arabicscript back to the Arabizi input.4.2Transliteration PipelineThe proposed system in this paper is called 3ARRIB.6 Using the context of an input Arabizi word,3ARRIB produces the word’s best Arabic scriptCODA transliteration. Figure 1 illustrates the different components of 3ARRIB in both the training and processing phases. We summarize the fulltransliteration process as follows. Each Arabizisentence input to 3ARRIB goes through a preprocessing step of lowercasing (de-capitalization),speech effects handling, and punctuation splitting. 3ARRIB then generates a list of all possible transliterations for each word in the input sentence using a finite-state transducer that is trainedon character-level alignment from Arabizi to Arabic script. We then experiment with different combinations of the following two components:Morphological Analyzer We use CALIMA(Habash et al., 2012b), a morphological analyzerfor EGY. For each input word, CALIMA providesall possible morphological analyses, including theCODA spelling for each analysis. All generatedcandidates are passed through CALIMA. If CALIMA has no analysis for a candidate, then thatcandidate gets filtered out; otherwise, the CODAspellings of the analyses from CALIMA becomethe new candidates in the rest of the transliterationpipeline. For some words, CALIMA may suggest multiple CODA spellings that reflect differentanalyses of the /ta3reeb5http://www.yamli.com/46333ARRIB (pronounced /ar-rib/) means “Arabize!”.

FSMFSMCandidatesCALIMA( tokenization)CALIMAOutputA* SearchBestSelectionsPreprocessingFST modelLMInput ArabiziScriptOutput ArabicScriptFSTSRILMGiza MADAMIRAArabizi – ArabicParallel DataEgyptian CorpusTraining phaseFigure 1: An illustration of the different components of the 3ARRIB system in both the training andprocessing phases. FST: finite-state Transducer; LM: Language Model; CALIMA: Morphological Analyzer for Dialectal Arabic; MADAMIRA: Morphological Tagger for Arabic.we use 2,200 Arabizi-to-Arabic script pairs fromthe training data used by (Darwish, 2013). Wemanually revised the Arabic side to be CODAcompliant. Second, we use about 6,300 pairsof proper names in Arabic and English fromthe Buckwalter Arabic Morphological Analyzer(Buckwalter, 2004). Since proper names are typically transliterated, we expect them to be a richsource for learning transliteration mappings.Language Model We disambiguate among thepossibilities for all input words (which constitute a “sausage” lattice) using an n-gram languagemodel.4.3PreprocessingWe apply the following preprocessing steps to theinput Arabizi text: We separate all attached emoticons such as(:D, :p, etc.) and punctuation from the words.We only keep the apostrophe because it isused in Arabizi to distinguish between different sounds. 3ARRIB keeps track of anyword offset change, so that it can reconstructthe same number of tokens at the end of thepipeline.The words in the parallel data are turned intospace-separated character tokens, which we alignusing Giza (Och and Ney, 2003). We then usethe phrase extraction utility in the Moses statisticalmachine translation system (Koehn et al., 2007) toextract a phrase table which operates over characters. The phrase table is then used to build afinite-state transducer (FST) that maps sequencesof Arabizi characters into sequences of Arabicscript characters. We use the negative logarithmicconditional probabilities of the Arabizi-to-Arabicpairs in the phrase tables as costs inside the FST.We use the FST to transduce an input Arabizi wordto one or more words in Arabic script, where every resulting word in Arabic script is given a probabilistic score. We tag emoticons and punctuation to protectthem from any change through the pipeline. We lowercase all letters. We handle speech effects by replacing anysequence of the same letter whose length isgreater than two by a sequence of exactlylength two; for example, iiiii becomes ii.4.4Character-Based TransductionAs part of the preprocessing of the parallel data,we associate all Arabizi letters with their wordlocation information (beginning, middle and ending letters). This is necessary since some ArabiziWe use a parallel corpus of Arabizi-Arabic wordsto learn a character-based transduction model.The parallel data consists of two sources. First,34

4.6mapping phenomena happen only at specific locations. For example, the Arabizi letter "o" is likelyto be transliterated into @ Â in Arabic if it appearsat the beginning of the word, but almost never soif it appears in the middle of the word.For some special Arabizi cases, we directlytransliterate input words to their correct Arabicform using a table, without going through the FST. For example, isa is mapped to é Ë@ ZA à@ Ǎn šA’Allh ‘God willing’. There are currently 32 entriesin this table.4.5Language ModelWe then use an EGY language model that istrained on CODA-compliant text. We investigate two options: a language model that has standard CODA white-space word tokenization conventions (“untokenized”), and a language modelthat has a D3 tokenized form of CODA in whichall clitics are separated (“tokenized”). The outputof the morphological analyzer (which is the inputto the LM component) is processed to match thetokenization used in the LM.The language models are built from a largecorpus of 392M EGY words.7 The corpus isfirst processed using CODAFY (Eskander et al.,2013), a system for spontaneous text conventionalization into CODA. This is necessary so thatour system remains CODA-compliant across thewhole transliteration pipeline. Eskander et al.(2013) states that the best conventionalization results are obtained by running the MLE componentof CODAFY followed by an EGY morphologicaltagger, MADA-ARZ (Habash et al., 2013). In thework reported here, we use the newer version ofMADA-ARZ, named MADAMIRA (Pasha et al.,2014). For the tokenized language model, we runa D3 tokenization step on top of the processed textby MADAMIRA. The processed data is used tobuild a language model with Kneser-Ney smoothing using the SRILM toolkit (Stolcke, 2002).We use A* search to pick the best transliterationfor each word given its context. The probability ofany path in the A* search space combines the FSTprobability of the words with the probability fromthe language model. Thus, for any certain path ofselected Arabic possibilities A0,i {a0 , a1 , .ai }given the corresponding input Arabizi sequenceW0,i {w0 , w1 , .wi }, the transliteration probability can be defined by equation (1).Morphological AnalyzerFor every word in the Arabizi input, all the candidates generated by the character-based transduction are passed through the CALIMA morphological analyzer. For every candidate, CALIMA produces a list of all the possible morphological analyses. The CODA for these analyses need not bethe same. For example, if the output from the character based transducer is Aly, then CALIMA produces the following CODA-compliantspellings: úÍ@ Ǎlý ‘to’, úÍ@ Ǎlý ‘to me’ and úÍ @ Āly ‘automatic’or ‘my family’. All of these CODA spellings arethe output of CALIMA for that particular inputword. The output from CALIMA then becomesthe set of final candidates of the input Arabizi inthe rest of the transliteration pipeline. If a wordis not recognized by CALIMA, it gets filtered outfrom the transliteration pipeline. However, if allthe candidates of some word are not recognizedby CALIMA, then we retain them all since thereshould be an output for every input word.We additionally run a tokenization step thatmakes use of the generated CALIMA morphological analysis. The tokenization scheme we target isD3, which separates all clitics associated with theword (Habash, 2010). For every word, we keepa list of the possible tokenized and untokenizedCODA-compliant pairs. We use the tokenized oruntokenized forms as inputs to either a tokenizedor untokenized language model, respectively, asdescribed in the next subsection. The untokenizedform is necessary to retain the surface form at theend of the transliteration process.Standalone clitics are sometimes found in Arabizi such as lel ragel (which corresponds toÉg. @P ÉË ll rAjl ‘for the man’). Since CALIMAdoes not handle most standalone clitics, we keepa lookup table that associates them with their tokenization information.P (A0,i W0,i ) iY(P (aj wj ) P (aj aj N 1,j 1 )) (1)j 0Where, N is the maximum affordable ngram length in the LM, P (aj wj ) is theFST probability of transliterating the Arabizi word wj into the Arabic word aj , andP (aj aj N 1,j 1 ) is the LM probability of the sequence {aj N 1 , aj N 2 , .aj }.7All of the resources we use are available from the Linguistic Data Consortium: www.ldc.upenn.edu.35

5Experiments and Results5.15.3Table 2 summarizes the results on the Dev set.Our best performing setup is FST-CALIMATokenized-LM-5 which has 77.5% accuracy and79.1% accuracy with normalization. The baselinesystem, FST-Untokenized-LM-5, gives 74.1% accuracy and 74.9 % accuracy with normalization.This highlights the value of morphological filtering as well as sparsity-reducing tokenization.Table 3 shows how we do (best system and bestbaseline) on a blind Test set. Although the accuracy drops overall, the gap between the best system and the baseline increases.DataWe use two in-house data sets for development(Dev; 502 words) and blind testing (Test; 1004words). The data contains EGY Arabizi SMSconversations that are mapped to Arabic script inCODA by a CODA-trained EGY native speaker.5.2ResultsExperimentsWe conducted a suite of experiments to evaluatethe performance of our approach and identify optimal settings on the Dev set. The optimal resultand the baseline are then applied to the blind Testset. During development, the following settingswere explored:5.4Error AnalysisWe conducted two error analyses for the best performing transliteration setting on the Dev set. Wefirst analyze in which component the Dev set errors occur. About 29% of the errors are caseswhere the FST does not generate the correct answer. An additional 15% of the errors happen because the correct answer is not covered by CALIMA. The language model does not include thecorrect answer in an additional 8% of the errors.The rest of the errors (48%) are cases where thecorrect answer is available in all components butdoes not get selected.Motivated by the value of Arabizi transliterationfor machine translation into English, we distinguish between two types of words: words that remain the same when translated into English, suchas English words, proper nouns, laughs, emoticons, punctuations and digits (EN-SET) versusEGY-only words (EGY-SET). Examples of wordsin EN-SET are: love you very much (code switching), Peter (proper noun), haha (laugh), :D (emoticon), ! (punctuation) and 123 (digits).While the overall performance of our best settings is 77.5%, the accuracy of the EGY-SET byitself is 84.6% as opposed to 46.2% for EN-SET.This large difference reflects the fact that we donot target English word transliteration into Arabicscript explicitly.We now perform a second error analysis only onthe errors in the EGY-SET, in which we categorizethe errors by their linguistic type. About 25% ofthe errors are non-CODA-compliant system output, where the answer is a plausible non-CODAform, i.e., a form that may be written or read easily by a native speaker who is not aware of CODA.For example, the system generates the non-CODA INV-Selection: The training data of the finitestate transducer is used to generate the list ofpossibilities for each input Arabizi word. Ifthe input word cannot be found in the FSTtraining data, the word is kept in Arabizi. FST-ONLY: Pick the top choice from the listgenerated by the finite state transducer. FST-CALIMA: Pick the top choice from thelist after the CALIMA filtering. FST-CALIMA-Tokenized-LM-5: Run thefull pipeline of 3ARRIB with a 5-gram tokenized LM.8 FST-CALIMA-Tokenized-LM-5-MLE:The same as FST-CALIMA-TokenizedLM-5, but for an Arabizi word that appearsin training, force its most frequently seenmapping directly instead of running thetransliteration pipeline for that word. FST-CALIMA-Untokenized-LM-5: Runthe full pipeline of 3ARRIB with a 5-gramuntokenized LM. FST-Untokenized-LM-5: Run the fullpipeline of 3ARRIB minus the CALIMA filtering with a 5-gram untokenized LM. Thissetup is analogous to the transliteration approach proposed by (Darwish, 2013). Thuswe use it as our baseline.Each of the above experiments is evaluatedwith exact match, and with Alif/Ya normalization(El Kholy and Habash, 2010; Habash, 2010).83, 5, and 7-gram LMs have been tested. The 3 and 5gram LMs give the same performance while the 7-gram LMis the worst.36

SystemINV-SelectionFST-ONLY (pick top choice)FST-CALIMA (pick top A/Y-normalization40.665.168.979.173.578.974.9Table 2: Results on the Dev set in terms of accuracy .4Table 3: Results on the blind Test set in terms of accuracy (%). JJÓ mynfςš instead of the correct CODAform ª JK AÓ mA ynfςš ‘it doesn’t work’. Ignorform ªing the CODA-related errors increases the overallaccuracy by about 3.0% to become 80.5%. The accuracy of the EGY-SET rises to 88.3% as opposedto 84.6% when considering CODA compliance.6Conclusion and Future WorkWe presented a method for converting dialectalArabic (specifically, EGY) written in Arabizi toArabic script following the CODA convention forDA orthography. We achieve a 17% error reduction over our implementation of a previously published work (Darwish, 2013) on a blind test set.In the future, we plan to improve several aspectsof our models, particularly FST character mapping, the morphological analyzer coverage, andlanguage models. We also plan to work on theproblem of automatic identification of non-Arabicwords. We will extend the system to work on otherArabic dialects. We also plan to make the 3ARRIB system publicly available.Ambiguous Arabizi input contributes to an additional 27% of the errors, where the system assigns a plausible answer that is incorrect in context. For example, the word matar in the inputArabizi fel matar ‘at the airport’ has two plausible out-of-context solutions: PA Ó mTAr ‘airport’(contextually correct) and Q Ó mTr ‘rain’ (contextually incorrect).AcknowledgementIn about 2% of the errors, the Arabizi input contains a typo making it impossible to produce thegold reference. For example, the input Arabiziba7bet contains a typo where the final t should turninto k, so that it means ½J.kAK. bAHbk ‘I love you[2fs]’.This paper is based upon work supported bythe Defense Advanced Research Projects Agency(DARPA) under Contract No. HR0011-12-C0014. Any opinions, findings and conclusions orrecommendations expressed in this paper are thoseof the authors and do not necessarily reflect theviews of DARPA.In the rest

tional purposes, but is easy to learn and recognize by educated Arabic speakers. Third, CODA uses the Arabic script as used for MSA, with no ex-tra symbols from, for example, Persian or Urdu. Fourth, CODA is intended as a unied framework for writing all dialects. CODA