R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:


Defining Data Question


a) Specifying the Question

A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ our services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.

b) Defining the Metric for success.

This project will be considered a success after we have thouroughly cleaned our data and performed both univariate and bivariate analysis and offering summaries of our dataset.

c) Understanding the context

The dataset that we will be using is an advertisement dataset.

d) Recording the experimental design.

The following steps will be followed in conducting this study:

  • Define the question, the metric for success, the context, experimental design taken.

  • Read and explore the given dataset. Define the appropriateness of the available data to answer the given question.

  • Find and deal with outliers, anomalies, and missing data within the dataset.

  • Perform univariate and bivariate analysis and recording our observations.

  • From our insights we will provide a conclusion and recommendation.

Viewing top entries

head(df)
##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           Ad.Topic.Line           City Male    Country
## 1    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2    Monitored national standardization      West Jodi    1      Nauru
## 3      Organic bottom-line service-desk       Davidton    0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5         Robust logistical utilization   South Manuel    0    Iceland
## 6       Sharable client-driven software      Jamieberg    1     Norway
##             Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11             0
## 2 2016-04-04 01:39:02             0
## 3 2016-03-13 20:35:42             0
## 4 2016-01-10 02:31:19             0
## 5 2016-06-03 03:36:18             0
## 6 2016-05-19 14:30:17             0
# checking data composition
str(df)
## 'data.frame':    1000 obs. of  10 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : chr  "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
##  $ Clicked.on.Ad           : int  0 0 0 0 0 0 0 1 0 0 ...
#checking dimension of our dataset
dim(df)
## [1] 1000   10
#confirming our dataset is a dataframe
class(df)
## [1] "data.frame"

Cleaning our data

Checking for missing values

sum(is.na(df))
## [1] 0
#there is no missing values 

Checking for duplicates

sum(duplicated(df))
## [1] 0
#there is no duplicates

Checking and dealing with outliers

boxplot(df$`Area.Income`,main="Boxplot for Area.Income",col = "grey")

boxplot(df$`Age`,main="Boxplot for Age",col = "orange")

boxplot(df$`Daily.Time.Spent.on.Site`,main="Boxplot for Daily.Time.Spent.on.Site",col = "green")

boxplot(df$`Male`,main="Boxplot for Male",col = "blue")

boxplot(df$`Daily.Internet.Usage`,main="Boxplot for Daily.Internet.Usage",col = "yellow")

boxplot(df$`Clicked.on.Ad`,main="Boxplot for Clicked.on.Ad",col = "red")

#We dont have many outliers in our columns so we will just leave it 

Univariate Analysis

summary(df)
##  Daily.Time.Spent.on.Site      Age         Area.Income    Daily.Internet.Usage
##  Min.   :32.60            Min.   :19.00   Min.   :13996   Min.   :104.8       
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
##  Median :68.22            Median :35.00   Median :57012   Median :183.1       
##  Mean   :65.00            Mean   :36.01   Mean   :55000   Mean   :180.0       
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
##  Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
##  Ad.Topic.Line          City                Male         Country         
##  Length:1000        Length:1000        Min.   :0.000   Length:1000       
##  Class :character   Class :character   1st Qu.:0.000   Class :character  
##  Mode  :character   Mode  :character   Median :0.000   Mode  :character  
##                                        Mean   :0.481                     
##                                        3rd Qu.:1.000                     
##                                        Max.   :1.000                     
##   Timestamp         Clicked.on.Ad
##  Length:1000        Min.   :0.0  
##  Class :character   1st Qu.:0.0  
##  Mode  :character   Median :0.5  
##                     Mean   :0.5  
##                     3rd Qu.:1.0  
##                     Max.   :1.0
#getting summary in our dataset i.e mean , quartiles, median, maximum and minimum

Getting important measures of dispersion(range and standard deviation)

cat("the range of age  is",range(df$'Age'))
## the range of age  is 19 61
cat("\n")
cat("the range of  Area.Income is",range(df$'Area.Income'))
## the range of  Area.Income is 13996.5 79484.8
cat("\n")
cat("the range of Daily.Time.Spent.on.Site  is",range(df$'Daily.Time.Spent.on.Site'))
## the range of Daily.Time.Spent.on.Site  is 32.6 91.43
cat("\n")
cat("the range of male  is",range(df$'Male'))
## the range of male  is 0 1
cat("\n")
cat("the range of  Daily.Internet.Usage is",range(df$'Daily.Internet.Usage'))
## the range of  Daily.Internet.Usage is 104.78 269.96
cat("\n")
cat("the standard deviation of age  is",sd(df$'Age'))
## the standard deviation of age  is 8.785562
cat("\n")
cat("the standard deviation of Area.Income  is",sd(df$'Area.Income'))
## the standard deviation of Area.Income  is 13414.63
cat("\n")
cat("the standard deviatione of Daily.Time.Spent.on.Site is",sd(df$'Daily.Time.Spent.on.Site'))
## the standard deviatione of Daily.Time.Spent.on.Site is 15.85361
cat("\n")
cat("the standard deviation of male is",sd(df$'Male'))
## the standard deviation of male is 0.4998889
cat("\n")
cat("the standard deviation of Daily.Internet.Usage  is",sd(df$'Daily.Internet.Usage'))
## the standard deviation of Daily.Internet.Usage  is 43.90234

Getting a histogram of our columns

 hist(df$`Area.Income`,main="histogram for Area.Income",col = "grey")

hist(df$`Age`,main="histogram for Age",col = "orange")

hist(df$`Daily.Time.Spent.on.Site`,main="histogram for Daily.Time.Spent.on.Site",col = "green")

hist(df$`Male`,main="histogram for Male",col = "blue")

hist(df$`Daily.Internet.Usage`,main="histogram for Daily.Internet.Usage",col = "yellow")

hist(df$`Clicked.on.Ad`,main="histogram for Clicked.on.Ad",col = "red")

Univariate Summary

  1. In our dataset, many people are aged between 25 and 40.
  2. In our dataset, the common time on most daily time spent on site is between 75 and 85.
  3. In our dataset, the common area income is between 50,000 and 70,000.
  4. In our dataset , there is averagely distributed # Bivariate analysis

Bivariate analysis

#assigning columns to respective variables
ts<-df$Daily.Time.Spent.on.Site
age<-df$Age
ai<-df$Area.Income
dis<-df$Daily.Internet.Usage
mal<-df$Male
ca<-df$Clicked.on.Ad

Getting variance between columns

cat("the variance between age and daily time spent on site is",var(ts,age))
## the variance between age and daily time spent on site is -46.17415
cat("\n")
cat("the variance between age and Area.Income is",var(age,ai))
## the variance between age and Area.Income is -21520.93
cat("\n")
cat("the variance between age and daily internet usage is",var(age,dis))
## the variance between age and daily internet usage is -141.6348
cat("\n")
cat("the variance between age and Clicked.on.Ad is",var(ca,age))
## the variance between age and Clicked.on.Ad is 2.164665
cat("\n")
cat("the variance between area income and daily time spent on site is",var(ts,ai))
## the variance between area income and daily time spent on site is 66130.81
cat("\n")
cat("the variance between daily internet usage and daily time spent on site is",var(ts,dis))
## the variance between daily internet usage and daily time spent on site is 360.9919
cat("\n")
cat("the variance between clicked on ad and daily time spent on site is",var(ts,ca))
## the variance between clicked on ad and daily time spent on site is -5.933143
cat("\n")
cat("the variance between daily internet usage and area income",var(ts,dis))
## the variance between daily internet usage and area income 360.9919
cat("\n")
cat("the variance between daily internet usage and area income is",var(ai,dis))
## the variance between daily internet usage and area income is 198762.5
cat("\n")
cat("the variance between daily internet usage and clicked on ad is",var(ca,dis))
## the variance between daily internet usage and clicked on ad is -17.27409
cat("\n")

Getting correlation between columns

cat("the correlation between age and daily time spent on site is",cor(ts,age))
## the correlation between age and daily time spent on site is -0.3315133
cat("\n")
cat("the correlation between age and Area.Income is",cor(age,ai))
## the correlation between age and Area.Income is -0.182605
cat("\n")
cat("the correlation between age and daily internet usage is",cor(age,dis))
## the correlation between age and daily internet usage is -0.3672086
cat("\n")
cat("the correlation between age and Clicked.on.Ad is",cor(ca,age))
## the correlation between age and Clicked.on.Ad is 0.4925313
cat("\n")
cat("the correlation between area income and daily time spent on site is",cor(ts,ai))
## the correlation between area income and daily time spent on site is 0.3109544
cat("\n")
cat("the correlation between daily internet usage and daily time spent on site is",cor(ts,dis))
## the correlation between daily internet usage and daily time spent on site is 0.5186585
cat("\n")
cat("the correlation between clicked on ad and daily time spent on site is",cor(ts,ca))
## the correlation between clicked on ad and daily time spent on site is -0.7481166
cat("\n")
cat("the correlation between daily internet usage and area income",cor(ts,dis))
## the correlation between daily internet usage and area income 0.5186585
cat("\n")
cat("the correlation between daily internet usage and area income is",cor(ai,dis))
## the correlation between daily internet usage and area income is 0.3374955
cat("\n")
cat("the correlation between daily internet usage and clicked on ad is",cor(ca,dis))
## the correlation between daily internet usage and clicked on ad is -0.7865392
cat("\n")

Getting covariance between columns

cat("the covariance between age and daily time spent on site is",cov(ts,age))
## the covariance between age and daily time spent on site is -46.17415
cat("\n")
cat("the covariance between age and Area.Income is",cov(age,ai))
## the covariance between age and Area.Income is -21520.93
cat("\n")
cat("the covariance between age and daily internet usage is",cov(age,dis))
## the covariance between age and daily internet usage is -141.6348
cat("\n")
cat("the covariance between age and Clicked.on.Ad is",cov(ca,age))
## the covariance between age and Clicked.on.Ad is 2.164665
cat("\n")
cat("the covariance between area income and daily time spent on site is",cov(ts,ai))
## the covariance between area income and daily time spent on site is 66130.81
cat("\n")
cat("the covariance between daily internet usage and daily time spent on site is",cov(ts,dis))
## the covariance between daily internet usage and daily time spent on site is 360.9919
cat("\n")
cat("the covariance between clicked on ad and daily time spent on site is",cov(ts,ca))
## the covariance between clicked on ad and daily time spent on site is -5.933143
cat("\n")
cat("the covariance between daily internet usage and area income",cov(ts,dis))
## the covariance between daily internet usage and area income 360.9919
cat("\n")
cat("the covariance between daily internet usage and area income is",cov(ai,dis))
## the covariance between daily internet usage and area income is 198762.5
cat("\n")
cat("the covariance between daily internet usage and clicked on ad is",cov(ca,dis))
## the covariance between daily internet usage and clicked on ad is -17.27409
cat("\n")

Plotting scatterplots between columns

plot(age, dis, xlab="age", ylab="daily internet usage",col = "orange")

plot(age,ai, xlab="age", ylab="area income",col="blue")

plot(age, ts, xlab="age", ylab="Time spent on site",col="red")

plot(age,ca, xlab="age", ylab="clicked on ad",col="yellow")

plot(ts,ai, xlab="Time spent on site", ylab="area income",col="pink")

plot(ts,dis, xlab="Time spent on site", ylab="daily internet usage",col="grey")

plot(ts,ca, xlab="Time spent on site", ylab="clicked on ad",col="green")

plot(ai,dis, xlab="area income", ylab="daily internet usage",col="purple")

plot(ca,dis, xlab="clicked on ad", ylab="daily internet usage",col="black")

Bivariate summary

  1. There is no correlation between clicked on ad and daily internet usage.
  2. There is no correlation between clicked on ad and time spent on site.
  3. There is no correlation between clicked on ad and age.
  4. All other columns excluding one that involves clicked on add and male have a moderate correlation.

Conclusion

Looking at our data analysis, we can see that there is a correlation between our main columns .


Recommendation

  1. More emphasis should be put on all ages , not just between 25 and 40 years. This will ensure that we get an accurate representation.