Evaluating GPT-4 Modeled Arabic-English Code Switching With Python Word2Vec
I evaluated GPT-4's ability to emulate Arabic-English code-switching by comparing synthetic examples with natural data through Word2Vec models. My analysis highlights a gap in GPT-4's capacity to replicate nuanced sociolinguistic phenomena like code-switching. The project reinforces the need for localized datasets to enhance the accuracy of language models in all contexts.
Read MoreExploring Variation Across Four Languages Using Reality TV Captions and Legal Documents
I analyzed register variation in web data for English, Finnish, Greek, and Portuguese by comparing it to legal documents and reality TV subtitles. Using Jaccard and cosine similarity, I explored how web documents align with these anchors. Python tools like Scikit-learn and Pandas were used for data cleaning, sampling, and similarity analysis.
Read MoreUsing Deep Learning and Embedding Spaces to Explore Gender Bias Online
Using large linguistic corpora from X (formerly Twitter), Reddit, and Wikipedia, I trained three separate 100 dimensional embedding space to store the semantic meaning of words. I then used these spaces to explore gender biases through my own custom designed metrics to measure the distance between gendered words and seniment related words on each platform.
Read MoreComparing Pop and Rap Music Lyrics With Embedding Spaces
I trained two 100 dimensional embedding spaces using Genius Lyrics data for pop and rap music. I then used the Jaccard Similarity of a variety of common words in each corpus to explore the relation of words in semantic clusters between the two genres.
Read MoreUsing Machine Learning KMeans Clustering Techniques To Categorize Wikipedia Articles
Using a large corpus of Wikipedia articles, I trained a KMeans clustering model to categorize the articles into several categories. I then explored the categories, and compared them to the categories that can be generated usign the tag data in each Wikipedia article's metadata. This outlined several differences and similarities between how humans and machines categorize information.
Read MoreDeveloping A Classifier For English Dialects
I trained a classifier using X (formerly known as Twitter) data to classify English Tweets into their country of origin. This provided a useful tool that could be used on other corpora without country labels to determine the country of origin of the text.
Read MoreCrackdown - A Productivity App Startup
I am one of two full stack developers creating Crackdown: A Productivity App from the ground up. As part of this project, I am developing for Android and IOS simultaneously in the Dart programming language using the Flutter library. In addition, I am managing user creation, authentication, and databases using Firebase and Cloud Firestore.
JohnSpeaks.com © Last Updated November 2024