Exploratory Data Analysis & Project Plan

Project Overview

This report shows the initial exploratory analysis of the activity recognition dataset and outlines the plan for building a prediction algorithm and Shiny app.

1. Data Loading & File Summary

We successfully downloaded and loaded the two required CSV files into R.

library(ggplot2)
library(knitr)

# 这里直接用你电脑上的路径
train <- read.csv("/Users/xuqiang/Downloads/b1/pml-training.csv")
test  <- read.csv("/Users/xuqiang/Downloads/b1/pml-testing.csv")

# 基本统计表格（行数、列数）
file_summary <- data.frame(
  Dataset = c("Training Set", "Testing Set"),
  Rows    = c(nrow(train), nrow(test)),
  Columns = c(ncol(train), ncol(test))
)

kable(file_summary, caption = "Basic Size of the Datasets")

Basic Size of the Datasets
Dataset	Rows	Columns
Training Set	19622	160
Testing Set	20	160

2. Basic Statistical Summary

Here is a quick look at the main variables in the training set.

summary(train[, 2:6])

##      user_name     raw_timestamp_part_1 raw_timestamp_part_2   cvtd_timestamp 
##  Length   :19622   Min.   :1.322e+09    Min.   :   294       Length   :19622  
##  N.unique :    6   1st Qu.:1.323e+09    1st Qu.:252912       N.unique :   20  
##  N.blank  :    0   Median :1.323e+09    Median :496380       N.blank  :    0  
##  Min.nchar:    5   Mean   :1.323e+09    Mean   :500656       Min.nchar:   16  
##  Max.nchar:    8   3rd Qu.:1.323e+09    3rd Qu.:751891       Max.nchar:   16  
##                    Max.   :1.323e+09    Max.   :998801                        
##      new_window   
##  Length   :19622  
##  N.unique :    2  
##  N.blank  :    0  
##  Min.nchar:    2  
##  Max.nchar:    3  
##

3. Data Visualization

We plot a histogram of one core sensor feature (roll_belt) to see its distribution.

ggplot(train, aes(x = roll_belt)) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Distribution of the `roll_belt` Sensor Data",
    x = "Value",
    y = "Frequency"
  )

4. Initial Findings

The training set has many more observations than the testing set.
The roll_belt variable shows a concentrated distribution, with most values between 0 and 100.
Some columns contain missing values or summary statistics that are not useful for prediction; they will be cleaned in the next step.

5. Plan for Prediction Algorithm & Shiny App

Prediction Algorithm

Clean the data by removing irrelevant columns and handling missing values.
Use cross-validation to train and test classification models (e.g., random forests or gradient boosting) to predict the classe variable.

Shiny App

Build a simple web interface where users can input sensor data.
Display the predicted activity class and basic confidence information in real time.