This report shows the initial exploratory analysis of the activity recognition dataset and outlines the plan for building a prediction algorithm and Shiny app.
We successfully downloaded and loaded the two required CSV files into R.
library(ggplot2)
library(knitr)
# 这里直接用你电脑上的路径
train <- read.csv("/Users/xuqiang/Downloads/b1/pml-training.csv")
test <- read.csv("/Users/xuqiang/Downloads/b1/pml-testing.csv")
# 基本统计表格(行数、列数)
file_summary <- data.frame(
Dataset = c("Training Set", "Testing Set"),
Rows = c(nrow(train), nrow(test)),
Columns = c(ncol(train), ncol(test))
)
kable(file_summary, caption = "Basic Size of the Datasets")
| Dataset | Rows | Columns |
|---|---|---|
| Training Set | 19622 | 160 |
| Testing Set | 20 | 160 |
Here is a quick look at the main variables in the training set.
summary(train[, 2:6])
## user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp
## Length :19622 Min. :1.322e+09 Min. : 294 Length :19622
## N.unique : 6 1st Qu.:1.323e+09 1st Qu.:252912 N.unique : 20
## N.blank : 0 Median :1.323e+09 Median :496380 N.blank : 0
## Min.nchar: 5 Mean :1.323e+09 Mean :500656 Min.nchar: 16
## Max.nchar: 8 3rd Qu.:1.323e+09 3rd Qu.:751891 Max.nchar: 16
## Max. :1.323e+09 Max. :998801
## new_window
## Length :19622
## N.unique : 2
## N.blank : 0
## Min.nchar: 2
## Max.nchar: 3
##
We plot a histogram of one core sensor feature
(roll_belt) to see its distribution.
ggplot(train, aes(x = roll_belt)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
labs(
title = "Distribution of the `roll_belt` Sensor Data",
x = "Value",
y = "Frequency"
)
roll_belt variable shows a concentrated
distribution, with most values between 0 and 100.classe
variable.