This is the alternative assessment 2 for course WIE2003: Introduction to Data Science.
Question a
The table below shows the results of prediction for people doing screening test for coronavirus disease (COVID-19). Begin your markdown by interpreting the confusion matrix. No R code is required to answer this question.
| n = 195 | Tested Positive | Tested Negative | Total |
|---|---|---|---|
| Actual Positive | 120 | 15 | 135 |
| Actual Negative | 10 | 50 | 60 |
| Total | 130 | 65 | 195 |
Answer for Question a
Confusion matrix is one of the method to check how prediction model is performing.
The frequency for the following: True Positive (TP) = 120 True Negative (TN) = 50 False Positive (FP) = 10 False Negative (FN) = 15
FP equivalent to Type I error, while FN equivalent to Type II error.
The model made 170 correct predictions (TP + TN). The model made 25 incorrect predictions (FP + FN). The error rate of this model is (25 / 195) * 100 = 12.82%. The overall accuracy rate of this model is (170 / 195) * 100 = 87.18%.
Noted: 1. Error rate of the model is calculated using (FP + FN) / (TP + TN + FP + FN). 2. Accuracy of the model is calculated using (TP + TN) / (TP + TN + FP + FN).
There are other measures we can calculated using the confusion matrix above.
Precision = (TP) / (TP + FP) * 100 = 92.31%. Negative Prediction Value = (TN) / (TN + FN) * 100 = 76.92%. Sensitivity = (TP) / (TP + FN) * 100 = 88.89%. Specificity = (TN) / (TN + FP) * 100 = 83.33%.
Question b
Find and get a data set from the data sets available within R. Perform exploratory data analysis (EDA) and prepare a codebook on that data set. Explain every answer given.
Answer for Question b
Let get a data set available within R to perform EDA.
Import several packages that is needed throughout this question.
library(dplyr)
library(ggplot2)
Details of the data set
Throughout this question, I will be using one of the R data sets which called nottem.
From this link, this data set is about the average monthly temperature at Nottingham, 1920 - 1939. All the temperatures inside the data set are measured in degree Fahrenheit.
Lets quickly look at the data set.
nottem
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1920 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5 42.9 39.8
## 1921 44.2 39.8 45.1 47.0 54.1 58.7 66.3 59.9 57.0 54.2 39.7 42.8
## 1922 37.5 38.7 39.5 42.1 55.7 57.8 56.8 54.3 54.3 47.1 41.8 41.7
## 1923 41.8 40.1 42.9 45.8 49.2 52.7 64.2 59.6 54.4 49.2 36.3 37.6
## 1924 39.3 37.5 38.3 45.5 53.2 57.7 60.8 58.2 56.4 49.8 44.4 43.6
## 1925 40.0 40.5 40.8 45.1 53.8 59.4 63.5 61.0 53.0 50.0 38.1 36.3
## 1926 39.2 43.4 43.4 48.9 50.6 56.8 62.5 62.0 57.5 46.7 41.6 39.8
## 1927 39.4 38.5 45.3 47.1 51.7 55.0 60.4 60.5 54.7 50.3 42.3 35.2
## 1928 40.8 41.1 42.8 47.3 50.9 56.4 62.2 60.5 55.4 50.2 43.0 37.3
## 1929 34.8 31.3 41.0 43.9 53.1 56.9 62.5 60.3 59.8 49.2 42.9 41.9
## 1930 41.6 37.1 41.2 46.9 51.2 60.4 60.1 61.6 57.0 50.9 43.0 38.8
## 1931 37.1 38.4 38.4 46.5 53.5 58.4 60.6 58.2 53.8 46.6 45.5 40.6
## 1932 42.4 38.4 40.3 44.6 50.9 57.0 62.1 63.5 56.3 47.3 43.6 41.8
## 1933 36.2 39.3 44.5 48.7 54.2 60.8 65.5 64.9 60.1 50.2 42.1 35.8
## 1934 39.4 38.2 40.4 46.9 53.4 59.6 66.5 60.4 59.2 51.2 42.8 45.8
## 1935 40.0 42.6 43.5 47.1 50.0 60.5 64.6 64.0 56.8 48.6 44.2 36.4
## 1936 37.3 35.0 44.0 43.9 52.7 58.6 60.0 61.1 58.1 49.6 41.6 41.3
## 1937 40.8 41.0 38.4 47.4 54.1 58.6 61.4 61.8 56.3 50.9 41.4 37.1
## 1938 42.1 41.2 47.3 46.6 52.4 59.0 59.6 60.4 57.0 50.7 47.8 39.2
## 1939 39.4 40.9 42.4 47.8 52.4 58.0 60.7 61.8 58.2 46.7 46.6 37.8
From the result computed, it seem like the data set is not in the form of data frame.
Data processing
Lets check whether the data set is in the form of data frame.
temp_data <- nottem;
is.data.frame(nottem)
## [1] FALSE
The data set is not in the data frame. Let coercing it to data frame.
temp_data <- as.data.frame(temp_data);
head(temp_data, 5)
## x
## 1 40.6
## 2 40.8
## 3 44.4
## 4 46.7
## 5 54.1
Look like the temp_data has only one column. By comparing the temp_data with the original data set, the temp_data is arranged in the increasing order of month before the year. The data set need to be processed before EDA.
Lets change the column name x to the temp.
temp_data <- rename(temp_data, temp=x);
head(temp_data, 5)
## temp
## 1 40.6
## 2 40.8
## 3 44.4
## 4 46.7
## 5 54.1
Lets add the column month.
# Calculate the number of rows in the data set.
n_row <- nrow(temp_data);
# Create a vector from 1 to the last row number.
m <- c(1:n_row);
# There are 12 months in a year. Therefore using remainder operation is appropriate to determine the month for each row correctly.
m <- m %% 12;
# Add m as month to the temp_data
temp_data <- mutate(temp_data, month=m);
# 12 % 12 = 0, which means 0 = December.
temp_data$month[which(temp_data$month == 0)] <- 12;
head(temp_data, 12)
## temp month
## 1 40.6 1
## 2 40.8 2
## 3 44.4 3
## 4 46.7 4
## 5 54.1 5
## 6 58.5 6
## 7 57.7 7
## 8 56.4 8
## 9 54.3 9
## 10 50.5 10
## 11 42.9 11
## 12 39.8 12
Lets add the year column and reconstruct the data frame.
# Create a vector for holding the 20 years.
year <- c(1920:1939);
# Filter based on month.
jan <- filter(temp_data, month==1) %>% select(temp);
feb <- filter(temp_data, month==2) %>% select(temp);
mar <- filter(temp_data, month==3) %>% select(temp);
apr <- filter(temp_data, month==4) %>% select(temp);
may <- filter(temp_data, month==5) %>% select(temp);
jun <- filter(temp_data, month==6) %>% select(temp);
jul <- filter(temp_data, month==7) %>% select(temp);
aug <- filter(temp_data, month==8) %>% select(temp);
sep <- filter(temp_data, month==9) %>% select(temp);
oct <- filter(temp_data, month==10) %>% select(temp);
nov <- filter(temp_data, month==11) %>% select(temp);
dec <- filter(temp_data, month==12) %>% select(temp);
# Create a new data frame and assign to temp_data.
temp_data <- data.frame(year=year, c(jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec))
head(temp_data, 5)
## year temp temp.1 temp.2 temp.3 temp.4 temp.5 temp.6 temp.7 temp.8 temp.9
## 1 1920 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5
## 2 1921 44.2 39.8 45.1 47.0 54.1 58.7 66.3 59.9 57.0 54.2
## 3 1922 37.5 38.7 39.5 42.1 55.7 57.8 56.8 54.3 54.3 47.1
## 4 1923 41.8 40.1 42.9 45.8 49.2 52.7 64.2 59.6 54.4 49.2
## 5 1924 39.3 37.5 38.3 45.5 53.2 57.7 60.8 58.2 56.4 49.8
## temp.10 temp.11
## 1 42.9 39.8
## 2 39.7 42.8
## 3 41.8 41.7
## 4 36.3 37.6
## 5 44.4 43.6
Change the column names to something useful.
temp_data <- rename(temp_data, jan=temp, feb=temp.1, mar=temp.2, apr=temp.3, may=temp.4, jun=temp.5, jul=temp.6, aug=temp.7, sep=temp.8, oct=temp.9, nov=temp.10, dec=temp.11);
head(temp_data, 5)
## year jan feb mar apr may jun jul aug sep oct nov dec
## 1 1920 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5 42.9 39.8
## 2 1921 44.2 39.8 45.1 47.0 54.1 58.7 66.3 59.9 57.0 54.2 39.7 42.8
## 3 1922 37.5 38.7 39.5 42.1 55.7 57.8 56.8 54.3 54.3 47.1 41.8 41.7
## 4 1923 41.8 40.1 42.9 45.8 49.2 52.7 64.2 59.6 54.4 49.2 36.3 37.6
## 5 1924 39.3 37.5 38.3 45.5 53.2 57.7 60.8 58.2 56.4 49.8 44.4 43.6
Check the number of row in the temp_data.
nrow(temp_data)
## [1] 20
There are 20 rows, which is correctly matched with 20 years (1920-1939).
Check whether there are any missing values on the data set.
summary(temp_data)
## year jan feb mar apr
## Min. :1920 Min. :34.80 Min. :31.30 Min. :38.30 Min. :42.10
## 1st Qu.:1925 1st Qu.:38.77 1st Qu.:38.35 1st Qu.:40.38 1st Qu.:45.40
## Median :1930 Median :39.70 Median :39.55 Median :42.60 Median :46.80
## Mean :1930 Mean :39.70 Mean :39.19 Mean :42.20 Mean :46.29
## 3rd Qu.:1934 3rd Qu.:41.00 3rd Qu.:40.92 3rd Qu.:44.10 3rd Qu.:47.15
## Max. :1939 Max. :44.20 Max. :43.40 Max. :47.30 Max. :48.90
## may jun jul aug
## Min. :49.20 Min. :52.70 Min. :56.80 Min. :54.30
## 1st Qu.:51.12 1st Qu.:56.98 1st Qu.:60.33 1st Qu.:59.83
## Median :52.90 Median :58.45 Median :61.75 Median :60.50
## Mean :52.56 Mean :58.04 Mean :61.90 Mean :60.52
## 3rd Qu.:53.88 3rd Qu.:59.10 3rd Qu.:63.67 3rd Qu.:61.80
## Max. :55.70 Max. :60.80 Max. :66.50 Max. :64.90
## sep oct nov dec
## Min. :53.00 Min. :46.60 Min. :36.30 Min. :35.20
## 1st Qu.:54.62 1st Qu.:48.27 1st Qu.:41.60 1st Qu.:37.25
## Median :56.60 Median :49.90 Median :42.85 Median :39.50
## Mean :56.48 Mean :49.49 Mean :42.58 Mean :39.53
## 3rd Qu.:57.65 3rd Qu.:50.55 3rd Qu.:43.75 3rd Qu.:41.73
## Max. :60.10 Max. :54.20 Max. :47.80 Max. :45.80
There are no missing value.
Exploratory Data Analysis (EDA)
Analysis One
Global warming is a threat to human. Every year, the temperature increases slightly. Lets see whether is it true during 1920 - 1939 in Nottingham
Compute the mean based on the year and add a new column named temp_mean to the existing data frame.
year_mean <- rowMeans(select(temp_data, jan:dec), na.rm=TRUE);
year_mean <- round(year_mean, 1);
temp_data <- mutate(temp_data, temp_mean = year_mean);
head(temp_data, 5)
## year jan feb mar apr may jun jul aug sep oct nov dec temp_mean
## 1 1920 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5 42.9 39.8 48.9
## 2 1921 44.2 39.8 45.1 47.0 54.1 58.7 66.3 59.9 57.0 54.2 39.7 42.8 50.7
## 3 1922 37.5 38.7 39.5 42.1 55.7 57.8 56.8 54.3 54.3 47.1 41.8 41.7 47.3
## 4 1923 41.8 40.1 42.9 45.8 49.2 52.7 64.2 59.6 54.4 49.2 36.3 37.6 47.8
## 5 1924 39.3 37.5 38.3 45.5 53.2 57.7 60.8 58.2 56.4 49.8 44.4 43.6 48.7
Plot a line graph for the mean temperature based on year.
year <- temp_data$year;
year_mean <- temp_data$temp_mean;
plot(year, year_mean, type = 'o', pch=20, col='blue', xlab='Year', ylab='Temperature (Degree Farenheit)', main='Mean Temperature based on Year')
The graph is fluctuated and do not have significant upward trend from 1920 to 1939. Therefore, the global warming effect is not very clear for Nottingham during the year 1920 to 1939.
Analysis 2
Is Summer the hottest and Winter the coldest in Nottingham?
To answer this question, lets compute the mean for Spring, Summer, Autumn and Winter.
Note: i) Spring (March, April, May) ii) Summer (June, July, August) iii) Autumn (September, October, November) iV) Winter (December, January, February)
# Select the months based on their seasons.
spring_month <- select(temp_data, mar:may);
summer_month <- select(temp_data, jun:aug);
autumn_month <- select(temp_data, sep:nov);
winter_month <- select(temp_data, dec, jan:feb);
year_spring_mean <- round(rowMeans(spring_month, na.rm=TRUE), 1);
year_summer_mean <- round(rowMeans(summer_month, na.rm=TRUE), 1);
year_autumn_mean <- round(rowMeans(autumn_month, na.rm=TRUE), 1);
year_winter_mean <- round(rowMeans(winter_month, na.rm=TRUE), 1);
# Add the season mean to the temp_data data frame.
temp_data <- mutate(temp_data, year_spring_mean=year_spring_mean, year_summer_mean=year_summer_mean, year_autumn_mean=year_autumn_mean, year_winter_mean=year_winter_mean);
head(temp_data, 5)
## year jan feb mar apr may jun jul aug sep oct nov dec temp_mean
## 1 1920 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5 42.9 39.8 48.9
## 2 1921 44.2 39.8 45.1 47.0 54.1 58.7 66.3 59.9 57.0 54.2 39.7 42.8 50.7
## 3 1922 37.5 38.7 39.5 42.1 55.7 57.8 56.8 54.3 54.3 47.1 41.8 41.7 47.3
## 4 1923 41.8 40.1 42.9 45.8 49.2 52.7 64.2 59.6 54.4 49.2 36.3 37.6 47.8
## 5 1924 39.3 37.5 38.3 45.5 53.2 57.7 60.8 58.2 56.4 49.8 44.4 43.6 48.7
## year_spring_mean year_summer_mean year_autumn_mean year_winter_mean
## 1 48.4 57.5 49.2 40.4
## 2 48.7 61.6 50.3 42.3
## 3 45.8 56.3 47.7 39.3
## 4 46.0 58.8 46.6 39.8
## 5 45.7 58.9 50.2 40.1
Further compute the mean on these seasons to summarise the year from 1920 to 1939 and plot a bar chart.
spring_mean <- mean(year_spring_mean);
summer_mean <- mean(year_summer_mean);
autumn_mean <- mean(year_autumn_mean);
winter_mean <- mean(year_winter_mean);
season_mean <- c(spring_mean, summer_mean, autumn_mean, winter_mean);
barplot(season_mean, xlab='Year', ylab='Temperature (Degree Farenheit)', col='purple', names.arg=c('Spring', 'Summer', 'Autumn', 'Winter'), main='Mean Temperature based on Season')
Yes, it is, Summer is the hottest and Winter is the coldest.
Analysis 3
Lets dig further into Summer and Winter. Which month is the coldest in Winter, while the hottest in the Summer?
Compute the mean for each month in these seasons respectively from the year 1920 to 1939.
# Summer
jun_mean <- mean(temp_data$jun);
jul_mean <- mean(temp_data$jul);
aug_mean <- mean(temp_data$aug);
summer_mean <- c(jun_mean, jul_mean, aug_mean);
# Winter
dec_mean <- mean(temp_data$dec);
jan_mean <- mean(temp_data$jan);
feb_mean <- mean(temp_data$feb);
winter_mean <- c(dec_mean, jan_mean, feb_mean);
Plot bar chart for Summer.
barplot(summer_mean, xlab='Month', ylab='Temperature (Degree Farenheit)', col='orange', names.arg=c('June', 'July', 'August'), main='Mean Temperature based on Summer Month')
Plot bar chart for Winter.
barplot(winter_mean, xlab='Month', ylab='Temperature (Degree Farenheit)', col='light blue', names.arg=c('December', 'January', 'February'), main='Mean Temperature based on Winter Month')
The difference is not significant for Winter. Lets compute the numerical value.
print(paste("December: ", dec_mean))
## [1] "December: 39.53"
print(paste("January: ", jan_mean))
## [1] "January: 39.695"
print(paste("February: ", feb_mean))
## [1] "February: 39.19"
July is the hottest month in Summer season, while February is the coldest month in Winter season.
Analysis 4
Let explore the temperature difference between the hottest month and the coldest month in Spring.
Compute the mean for each month in Spring from the year 1920 to 1939.
# Spring
mar_mean <- mean(temp_data$mar);
apr_mean <- mean(temp_data$apr);
may_mean <- mean(temp_data$may);
spring_mean <- c(mar_mean, apr_mean, may_mean);
Plot a bar chart.
barplot(spring_mean, xlab='Month', ylab='Temperature (Degree Farenheit)', col='red', names.arg=c('March', 'April', 'May'), main='Mean Temparature based on Spring Month')
May is the hottest month, while March is the coldest month during the Spring.
Find out the temperature difference in Degree Celsius between these two months.
# Convert to Degree Celsius
may_mean_c = (may_mean - 32) * (5/9.0);
mar_mean_c = (mar_mean - 32) * (5/9.0);
print(paste("Temperature difference: ", round(may_mean_c - mar_mean_c, 2)))
## [1] "Temperature difference: 5.76"
The temperature difference is 5.76 Degree Celsius.
Conclusion
There are several findings from the EDA.
The global warming effect is not very significant shown for Nottingham during the year 1920 to 1939. More features are needed to explore the global warming effect during those years.
Summer is the hottest, Winter is the coldest based on the nottam data set.
July is the hottest month in Summer season, while February is the coldest month in Winter season. Everyone can be well-prepared before the extreme temperature happened in Summer (Hot) and Winter (Cold) for the following years.
The temperature difference between the hottest month and the coldest month is 5.76 Degree Celsius. This temperature difference may seem big because the Spring season is located between the coldest season (Winter) and the hottest season (Summer).
Question c
Demonstrate useful functions of dplyr for data manipulation for the following: i. Change the existing column name to something new. ii. Pick rows based on their values. iii. Add news columns to a data frame. iv. Combine data across two or more data frames.
Explain the use of each function, show the R code and provide a short explanation for each produced output. You can create your own sensible dataset for this question with at least 10 observations.
Answer for Question c
Let me import the dplyr package.
library(dplyr)
Create a dataset with at least 10 observations.
# d represents day, m represents month, t represents temperature in Degree Celsius.
d <- c(1:10);
m <- rep(5,10);
t <- c(29,30,30,29,28,29,29,29,30,30);
weather_df <- data.frame(d = d, m = m, t = t);
weather_df
## d m t
## 1 1 5 29
## 2 2 5 30
## 3 3 5 30
## 4 4 5 29
## 5 5 5 28
## 6 6 5 29
## 7 7 5 29
## 8 8 5 29
## 9 9 5 30
## 10 10 5 30
i. Change the existing column name to something new.
The rename function in dplyr package can be used to change the variable name to something meaningful.
weather_df <- rename(weather_df, day=d, month=m, temp=t);
weather_df
## day month temp
## 1 1 5 29
## 2 2 5 30
## 3 3 5 30
## 4 4 5 29
## 5 5 5 28
## 6 6 5 29
## 7 7 5 29
## 8 8 5 29
## 9 9 5 30
## 10 10 5 30
The column names, which d is changed to day, m is changed to month, and t is change to temp.
ii. Pick rows based on their values.
The filter function in dplyr package can be used to filter rows based on variable condition.
# Let try to find the number of day that the temperature is at least 30 Degree Celsius.
# Subset the data that temp >= 30
filtered_df <- filter(weather_df, temp >= 30);
# Show the filtered_df data frame.
filtered_df
## day month temp
## 1 2 5 30
## 2 3 5 30
## 3 9 5 30
## 4 10 5 30
# Find the number of rows
nrow(filtered_df)
## [1] 4
From the result computed, there are 4 days which the day temperature is at least 30 Degree Celsius.
iii. Add news columns to a data frame.
Using the mutate function in dplyr package to add new columns (variables).
# Add two new columns which is the highest wind speed (km/h) and highest humidity (%) on that day.
h_wind <- c(17, 15, 13, 20, 19, 15, 15, 13, 15, 19);
h_humidity <- c(100, 94, 94, 94, 100, 100, 100, 94, 94, 94);
weather_df <- mutate(weather_df, h_wind = h_wind, h_humidity = h_humidity);
weather_df
## day month temp h_wind h_humidity
## 1 1 5 29 17 100
## 2 2 5 30 15 94
## 3 3 5 30 13 94
## 4 4 5 29 20 94
## 5 5 5 28 19 100
## 6 6 5 29 15 100
## 7 7 5 29 15 100
## 8 8 5 29 13 94
## 9 9 5 30 15 94
## 10 10 5 30 19 94
The vector h_wind and h_humidity is added as new variables into the data frame.
iv. Combine data across two or more data frames.
Create another weather data frame that contains the following 5 days.
d <- c(11:15);
m <- rep(5,5);
t <- c(30, 29, 30, 29, 29);
h_wind <- c(15, 19, 17, 7, 15);
h_humidity <- c(89, 94, 94, 94, 94);
weather_df_2 <- data.frame(day=d, month=m, temp=t, h_wind=h_wind, h_humidity=h_humidity);
weather_df_2
## day month temp h_wind h_humidity
## 1 11 5 30 15 89
## 2 12 5 29 19 94
## 3 13 5 30 17 94
## 4 14 5 29 7 94
## 5 15 5 29 15 94
Merge two weather data frame into one using the union function in dplyr package. union function combine data vertically for rows that appear in either or both x and y.
weather_df <- union(weather_df, weather_df_2);
weather_df
## day month temp h_wind h_humidity
## 1 1 5 29 17 100
## 2 2 5 30 15 94
## 3 3 5 30 13 94
## 4 4 5 29 20 94
## 5 5 5 28 19 100
## 6 6 5 29 15 100
## 7 7 5 29 15 100
## 8 8 5 29 13 94
## 9 9 5 30 15 94
## 10 10 5 30 19 94
## 11 11 5 30 15 89
## 12 12 5 29 19 94
## 13 13 5 30 17 94
## 14 14 5 29 7 94
## 15 15 5 29 15 94
The weather_df is combined vertically with weather_df_2 completely without dropping any rows.