Introduction

This report presents an exploratory analysis of the text data provided for the Coursera Data Science Capstone project. The goal of this milestone is to demonstrate familiarity with the datasets and outline a plan for building a text prediction algorithm and a Shiny application.

The datasets consist of text from three sources: - Blogs - News - Twitter


Data Loading

The data files were downloaded and successfully loaded into R for analysis. Each file contains English text collected from different real-world sources.


Basic Summary Statistics

The three datasets differ significantly in size and structure.

Key statistics explored include: - Number of lines - Number of words - Distribution of word lengths


Exploratory Data Analysis

Basic exploratory analysis was performed to understand the structure of the text data.

These plots show that most lines are short, but there are a few very long entries, especially in the blogs and news datasets.


Interesting Findings

Some notable observations include: - Twitter text is highly informal and short - Blog text contains longer sentences and richer vocabulary - News text is more formal and structured

This variation suggests that preprocessing steps such as cleaning, tokenization, and filtering will be important.


Plan for Prediction Algorithm

The final prediction algorithm will be based on n-gram language models. The plan includes: - Cleaning and preprocessing the text - Tokenizing words and phrases - Building unigram, bigram, and trigram models - Selecting the most probable next word based on user input


Plan for Shiny Application

A Shiny web application will be developed to allow users to type text and receive word predictions in real time. The app will: - Accept text input from the user - Display predicted next words - Use an efficient backend to ensure fast response time


Conclusion

This milestone confirms that the data has been successfully loaded and explored. The findings from this exploratory analysis provide a strong foundation for developing the final prediction algorithm and Shiny application.