This report presents an exploratory analysis of the text data provided for the Coursera Data Science Capstone project. The goal of this milestone is to demonstrate familiarity with the datasets and outline a plan for building a text prediction algorithm and a Shiny application.
The datasets consist of text from three sources: - Blogs - News - Twitter
The data files were downloaded and successfully loaded into R for analysis. Each file contains English text collected from different real-world sources.
The three datasets differ significantly in size and structure.
Key statistics explored include: - Number of lines - Number of words - Distribution of word lengths
Basic exploratory analysis was performed to understand the structure of the text data.
These plots show that most lines are short, but there are a few very long entries, especially in the blogs and news datasets.
Some notable observations include: - Twitter text is highly informal and short - Blog text contains longer sentences and richer vocabulary - News text is more formal and structured
This variation suggests that preprocessing steps such as cleaning, tokenization, and filtering will be important.
The final prediction algorithm will be based on n-gram language models. The plan includes: - Cleaning and preprocessing the text - Tokenizing words and phrases - Building unigram, bigram, and trigram models - Selecting the most probable next word based on user input
A Shiny web application will be developed to allow users to type text and receive word predictions in real time. The app will: - Accept text input from the user - Display predicted next words - Use an efficient backend to ensure fast response time
This milestone confirms that the data has been successfully loaded and explored. The findings from this exploratory analysis provide a strong foundation for developing the final prediction algorithm and Shiny application.