ÄØŽŠ

About

To populate the Funetik Inggliš (FI) database with words and their phonetic representations, I used several sources, three with just the English spelling and one that also contained phonetic data.

I performed SQL intersections on the data, to identify words which appeared in multiple sources as important words to include in this initial word database, which I hope will continue to grow.

I would like to expand this site to all users to create accounts and submit their own pronunciations of words, linked to their demographic data.

There are likely still inaccurate phonetic representations in the database, either from the source material or from an over-reaching regular-expression-based SQL update statement which changed too many things. An example of this was changing words like "competition", which was represented as "K AA2 M P AH0 T IH1 SH AH0 N" in the CMU list, which would be "kamputišun" in FI. I deemed it more accurate to write this as "kampitišin". The first instance of schwa after "kamp" could fairly be either [ʌ] or [ɪ], I think, but the final vowel before 'n' should definitely be [ɪ], not [ʌ], as [ʌ] in this position sounds and feels awkward. Simply replacing all words ending in -un with -in wouldn't work, though, as the word "fun" would become "fin". Therefore, a more nuanced approach and a lot of proofreading was required.

While that is a seemingly minor change, other changes were brɔdør broader. For instance, in the interest of refining an elegant orthography, I elected not to include the open-mid back rounded vowel [ɔ], present in American English speakers without the cot–caught merger. The CMU authors evidently did pronounce words with this vowel, and as such I had to replace [ɔ] with 'o' or 'a'. The CMU authors used [ɔ] for both "almost" (AO1 L M OW2 S T) and "war" (W AO1 R).

While descriptive grammar is the only way to go (my idiolect isn't "better"), I had to choose one way to write each word for the purposes of this thought experiment. As long as it's consistent, it's better than the amalgam of historical accidents which is our current orthography.

Credits

The CMU Pronouncing Dictionary

Source of pronunciation data which was substantially reformatted into a more readable format, with symbols for stress and syllable boundaries replacing the system of numbers.

Most of the phonetic transliterations for Phonetic English were adapted from or at least compared against the CMU Pronouncing Dictionary. I only used about a third of their data, and made several broad changes affecting all or many of the words to make them consistent with the funeticization I thought could represent a general American accent. The changes I made were informed by my own intuitions and idiolect, as well as what I learned in my study of linguistics, but not based on a large sampling of speaker data.

License

CMUdict  --  Major Version: 0.07

Copyright (C) 1993-2015 Carnegie Mellon University. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

1. Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.
    The contents of this file are deemed to be source code.

2. Redistributions in binary form must reproduce the above copyright
    notice, this list of conditions and the following disclaimer in
    the documentation and/or other materials provided with the
    distribution.

This work was supported in part by funding from the Defense Advanced
Research Projects Agency, the Office of Naval Research and the National
Science Foundation of the United States of America, and by member
companies of the Carnegie Mellon Sphinx Speech Consortium. We acknowledge
the contributions of many volunteers to the expansion and improvement of
this dictionary.

THIS SOFTWARE IS PROVIDED BY CARNEGIE MELLON UNIVERSITY ``AS IS'' AND
ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY
NOR ITS EMPLOYEES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

english-words from dwyl on github

GitHub user dwyl ryformätid a list of words originally from Project Gutenberg.

Original Data Sources

http://www.gutenberg.org/etext/3201

Moby Word Lists by Grady Ward

Help
Bibliographic Record [help] Creator	Ward, Grady
Title	Moby Word Lists
Language	English
LoC Class	PS: Language and Literatures: American literature
EText-No.	3201
Release Date	2002-05-01
Copyright Status	Not copyrighted in the United States. If you live elsewhere check the laws of your country before downloading this ebook.

http://www.gutenberg.org/etext/3202

Moby Thesaurus List by Grady Ward

Creator	Ward, Grady
Title	Moby Thesaurus List
Language	English
LoC Class	PS: Language and Literatures: American literature
EText-No.	3202
Release Date	2002-05-01
Copyright Status	Not copyrighted in the United States. If you live elsewhere check the laws of your country before downloading this ebook.

wordsEn.txt from sil.org

A list of 109582 English words compiled and corrected in 1991 from lists obtained from the Interociter bulletin board. The original read.me file said that the list came from Public Brand Software.

This word list includes inflected forms, such as plural nouns and the -s, -ed and -ing forms of verbs. Thus the number of lexical stems represented in the list is considerably smaller than the total number of words.

SSA: Top Names Over the Last 100 Years

The 100 most popular given names for male and female babies born during 1917-2016 according to the Social Security Administration.

Part of Speech Database on GitHub

This is a great list of words with symbols for what part of speech they can be.

I still need to make sure that where homonyms are not homophonic (e.g. the verb and noun "record"), the part(s)-of-speech entry is split appropriately.

Also, not all words are currently tagged with part(s)-of-speech.

Part Of Speech Database
July 23, 2000

Compiled by Kevin Atkinson <kevina@users.sourceforge.net>

The part-of-speech.txt file contains is a combination of 
"Moby (tm) Part-of-Speech II" and the WordNet database.

                          COPYRIGHT

The Moby database was explicitly pleased in the public domain:

    The Moby lexicon project is complete and has
    been place into the public domain. Use, sell,
    rework, excerpt and use in any way on any platform.
    
    Placing this material on internal or public servers is
    also encouraged. The compiler is not aware of any
    export restrictions so freely distribute world-wide.
    
    You can verify the public domain status by contacting
    
    Grady Ward
    3449 Martha Ct.
    Arcata, CA  95521-4884
    
    grady@netcom.com
    grady@northcoast.com


The WordNet database is under the following Copyright:

    This software and database is being provided to you, the LICENSEE, by
    Princeton University under the following license.  By obtaining, using  
    and/or copying this software and database, you agree that you have  
    read, understood, and will comply with these terms and conditions.:  
    
    Permission to use, copy, modify and distribute this software and
    database and its documentation for any purpose and without fee or
    royalty is hereby granted, provided that you agree to comply with  
    the following copyright notice and statements, including the disclaimer,  
    and that the same appear on ALL copies of the software, database and  
    documentation, including modifications that you make for internal  
    use or for distribution.  
    
    WordNet 1.6 Copyright 1997 by Princeton University.  All rights reserved.  
    
    THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
    UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
    IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
    UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
    ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
    OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
    INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
    OTHER RIGHTS.
    
    The name of Princeton University or Princeton may not be used in  
    advertising or publicity pertaining to distribution of the software  
    and/or database.  Title to copyright in this software, database and  
    any associated documentation shall at all times remain with  
    Princeton University and LICENSEE agrees to preserve same.  

I assign no additional copyright to the combined database and the
software is explicitly being pleased in the public domain. However, I
would appreciate credit for my work.

NLTK Distance Metrics

I based my implementation for calculating Levenshtein distance on the Python implementation in the Natural Language Twlkit, part of the nltk.metrics.distance module.

Copyright (C) 2001-2017 NLTK Project
Author: Edward Loper <edloper@gmail.com>
        Steven Bird <stevenbird1@gmail.com>
        Tom Lippincott <tom@cs.columbia.edu>
URL: <http://nltk.org/>
For license information, see LICENSE.TXT

LICENSE.TXT

Copyright (C) 2001-2014 NLTK Project

Licensed under the Apache License, Version 2.0 (the 'License');
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an 'AS IS' BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.