Coursera Data Science Capstone Milestone Report

Introduction

This report will perform exploratory analysis to understand statistical properties of the provided data sets that can later be used to build a prediction model for a final Shiny application. Here we will identify some major features of the training data and then summarize plans for the predictive model.

The model will be trained using a document corpus compiled from three sources of text data: Blogs, News, and Twitter.

This project will only focus on the English corpora (though four languages are available)

The motivation for this project is to: 1. Demonstrate that I’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that I’ve amassed so far. 4. Get feedback on my plans for creating a prediction algorithm and Shiny app. ## Grading Criteria: Does the link lead to an HTML page describing the exploratory analysis of the training data set?

Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?

Has the data scientist made basic plots, such as histograms to illustrate features of the data?

Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Load Data and Libraries

The data sets were quite large (and my laptop memory and internet not up to repeated downloads), so I pre-downloaded (adn un-zipped) the data from the provided internet link (“https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”) into the file path “C:/Users/tobin/Documents/GitHub/Capstone/”

Below is the code for loading the required libraries

library(knitr)
library(ggplot2)
rm(list = ls(all.names = TRUE))

In Use a Function to Do Exploratory Data Analysis

Below is the code for a function that will do the exploratory data analysis on the downloaded files. It will provide:

Word counts Character counts Line counts Minimum, maximum, and mean words per line File sizes Word frequency histograms (Top 25 Words) Words per line histograms The display of each data table is hashed out due to the lack of value it provides here.

trainURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
trainDataFile <- "data/Coursera-SwiftKey.zip"
if (!file.exists('data')) {
    dir.create('data')
}
if (!file.exists("data/final/en_US")) {
    tempFile <- tempfile()
    download.file(trainURL, tempFile)
    unzip(tempFile, exdir = "data")
    unlink(tempFile)
}

# blogs
blogsFileName <- "data/final/en_US/en_US.blogs.txt"
con <- file(blogsFileName, open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# news
newsFileName <- "data/final/en_US/en_US.news.txt"
con <- file(newsFileName, open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)

perform_eda <- function(file_path) {
  # Load required packages
  if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
  library(ggplot2)
  
  # Read data
  data <- readLines(file_path, warn = FALSE)
  
  # Basic statistics
  num_lines <- length(data)
  line_lengths <- nchar(data)
  avg_line_length <- mean(line_lengths)
  file_size_mb <- file.info(file_path)$size / (1024^2)
  
  # Print summary
  cat("----- EDA Report -----\n")
  cat("File path:", file_path, "\n")
  cat("Number of lines:", num_lines, "\n")
  cat("Average characters per line:", round(avg_line_length, 1), "\n")
  cat("File size (MB):", round(file_size_mb, 2), "\n\n")
  
  # Create histograms
  hist(line_lengths, 
       main = "Distribution of Line Lengths", 
       xlab = "Characters per Line",
       col = "skyblue")
  
  # Boxplot
  boxplot(line_lengths,
          main = "Line Length Distribution",
          ylab = "Characters")
}

# Check if file exists
file.exists("data/final/en_US/en_US.blogs.txt")  # Should return TRUE

## [1] TRUE

perform_eda("data/final/en_US/en_US.blogs.txt")

## ----- EDA Report -----
## File path: data/final/en_US/en_US.blogs.txt 
## Number of lines: 899288 
## Average characters per line: 230 
## File size (MB): 200.42

##Interessting Findings and Way Ahead The Word Histograms confirm general knowledge of the prevalence of articles, prepositions, pronouns and other less specific words vice more specific words (noun, Proper noun, verb, etc.).

The histogram shows a need for making all words the same capitalization so words like “the” and “The” are only counted in one category.

A general idea of how to turn this information into a word prediction algorithm will be to further clean the data:

All un-capitalized Remove all non-English words/characters Count bi- and tri-word combos, as single words will not help predict Find a way to run the algorithm fast, maybe by sampling the data set vice using the whole thing. In real life the algorithm could run in the back ground and just provide the bi-/tri-word predictors in a compact file. Otherwise the tie lag would make the app unusable or less accurate.

Coursera Data Science Capstone Milestone Report

Asmaa

2025-02-05

Introduction

Load Data and Libraries

In Use a Function to Do Exploratory Data Analysis