- Set up R and RStudio
- Understand the layout and functionality of RStudio
- Experiment with basic R expressions and data types
- Data manipulation
- Cluster Analysis
R is a programming environment
Step by step installation guides from YouTube:
RStudio is a development environment for R, and provides many advanced features to improve efficiency and ease of use for R users.


This is the most important panel, because this is where R actually does stuff

Collections of commands (scripts) can be edited and saved.


There are ways to speed up the workflow:
If you don’t select any code, R will just execute the line where the blinking cursor is
Instead of clicking the “Run” icon, you can just use the keyboard shortcut: Ctrl + Enter


7 + 5
## [1] 12
7 - 5
## [1] 2
7 * 5
## [1] 35
7 / 5
## [1] 1.4

Like a calculator, R also has many functions that let you do more sophisticated manipulations.
round(2.05)
## [1] 2
factorial(3) # 3! = 3 * 2 * 1
## [1] 6
sqrt(9) # square root
## [1] 3
Note:
# (number sign) to comment your codes#There will be many occasions where you want to learn more about a built-in command or function. Type help(function_name) or ?function_name to get more information. For example:
help(factorial) ?factorial
Use two question marks to search the whole help database, especially when you don’t know exactly the function name. For example,
??read
R can recognize different types of data:
We can give names to data objects; these give us variables
Variables are created with the assignment operator, <- or =
Be careful that R is a case sensitive language. FOO, Foo, and foo are three different variables!
x = 2 # use the equal sign to assign value y <- 3 # you can also use an arrow to assign value x # print the value of a variable by typing its name
## [1] 2
x * y
## [1] 6
The assignment operator also changes values:
x
## [1] 2
x <- 8 x
## [1] 8
Using names and variables makes code: easier to design, easier to debug, less prone to bugs, easier to improve, and easier for others to read
Variable names cannot begin with numbers. Wise to avoid special characters, except for period (.) and underline (_)
Example of valid names:
Example of invalid names:
A data frame is a set of vectors of equal length. Consider data frame as an Excel sheet or a database table.
Column names are preserved or guessed if not explicitly set
course <- c("MIS4730", "MIS4710", "MIS4950", "MIS1234")
num_of_students <- c(20, 10, 40, 30)
data_analytics_minor <- c(TRUE, TRUE, TRUE, FALSE)
df <- data.frame(course, n_students=num_of_students, data_analytics_minor,
stringsAsFactors=F)
df # notice the column names and row names
## course n_students data_analytics_minor ## 1 MIS4730 20 TRUE ## 2 MIS4710 10 TRUE ## 3 MIS4950 40 TRUE ## 4 MIS1234 30 FALSE
There are many ways you can get values out of a column:
dataframe_name$column_namedf$course
## [1] "MIS4730" "MIS4710" "MIS4950" "MIS1234"
df$n_students
## [1] 20 10 40 30
Importing data into R is fairly simple. You can go to Canvas - Dataset Module to download the following data.
Your working directory is the folder on your computer in which you are currently working.
# Show your current working directory getwd() # List the files and folders in the current working directory list.files()
You can set your working directory in the following ways:
You can set your working directory in the following ways:
You can set your working directory in the following ways:
Create a new folder QBA-Dataset for this course; Download the HousePrices data and move them into the folder
Set this new folder as your default working directory by Tools > Global option
Run list.files() in the R console. Do you see these files?
You can view your data by double clicking the data
head() is a function allows you to see the top few rows of the data frame
df <- read.csv("HousePrices.csv")
head(df, n=3) # n indicates how many rows you'd like to see
## X price lot_size waterfront age land_value construction air_cond fuel ## 1 1 132500 0.09 No 42 50000 No No Electric ## 2 2 181115 0.92 No 0 22300 No No Gas ## 3 3 109000 0.19 No 133 7300 No No Gas ## heat sewer living_area fireplaces bathrooms rooms ## 1 Electric Private 906 1 1.0 5 ## 2 Hot Water Private 1953 0 2.5 6 ## 3 Hot Water Public 1944 1 1.0 8
Reminder: the way to use a variable in the dataset: dataframe_name$column_name
For example, if we wanted to estimate the number of bedrooms using number of rooms minus number of bathrooms:
df$bedrooms <- df$rooms - df$bathrooms
Observe the variable list, how many variables do we have now?
Create a new variable a new variable price_per_bedroom which is price divided by the number of bedrooms we just created.
You are a marketing senior analyst in a global consulting firm. Your manager gave you a project to conduct a more in-depth analysis of teenagers’ market segments so that your customer firm will be able to extend their market of products and increase their retail sales.
teens<-read.csv("snsdata.csv")
The data was sampled across our high school graduation years. The full text of their Twitter profiles were scraped and each teen’s gender, age, and number of friends were also recorded. The final dataset indicates how many times each word appeared in the person’s profile.
First, we select the variables that we use to a new variable.
Name the new variable as interests
In this case study, we will only make clusters by the words that teenagers mentioned. So we will use variables from collumn 5 to 40.
interests <- teens[5:40]
The standardization ensures all variables contribute equally to the cluster results
Here the function that we use for standardization is called scale.
lapply means apply standardization to all variables. as.data.frame make sure the data after standardization is the correct data format.
interests_z <- as.data.frame(lapply(interests, scale))
To cluster the teenagers into marketing segments, we will use kmeans() function
The first two commands can be skipped. We are just using a random seed to make sure we can the same results.
The clusters are created from an algorithm with a ramdom initial value, so the cluster results could be slightly different every time we run the codes.
# create the clusters using k-means
RNGversion("3.5.2") # use an older random number generator to get the same result
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
set.seed(2345) teen_clusters <- kmeans(interests_z, 5)
Cluster sizes
# look at the size of the clusters teen_clusters$size
## [1] 871 600 5981 1034 21514
Take the cluster mean and evaluate the results
teen_clusters$centers
## basketball football soccer softball volleyball swimming ## 1 0.16001227 0.2364174 0.10385512 0.07232021 0.18897158 0.23970234 ## 2 -0.09195886 0.0652625 -0.09932124 -0.01739428 -0.06219308 0.03339844 ## 3 0.52755083 0.4873480 0.29778605 0.37178877 0.37986175 0.29628671 ## 4 0.34081039 0.3593965 0.12722250 0.16384661 0.11032200 0.26943332 ## 5 -0.16695523 -0.1641499 -0.09033520 -0.11367669 -0.11682181 -0.10595448 ## cheerleading baseball tennis sports cute sex ## 1 0.3931445 0.02993479 0.13532387 0.10257837 0.37884271 0.020042068 ## 2 -0.1101103 -0.11487510 0.04062204 -0.09899231 -0.03265037 -0.042486141 ## 3 0.3303485 0.35231971 0.14057808 0.32967130 0.54442929 0.002913623 ## 4 0.1856664 0.27527088 0.10980958 0.79711920 0.47866008 2.028471066 ## 5 -0.1136077 -0.10918483 -0.05097057 -0.13135334 -0.18878627 -0.097928345 ## sexy hot kissed dance band marching ## 1 0.11740551 0.41389104 0.06787768 0.22780899 -0.10257102 -0.10942590 ## 2 -0.04329091 -0.03812345 -0.04554933 0.04573186 4.06726666 5.25757242 ## 3 0.24040196 0.38551819 -0.03356121 0.45662534 -0.02120728 -0.10880541 ## 4 0.51266080 0.31708549 2.97973077 0.45535061 0.38053621 -0.02014608 ## 5 -0.09501817 -0.13810894 -0.13535855 -0.15932739 -0.12167214 -0.11098063 ## music rock god church jesus bible ## 1 0.1378306 0.05905951 0.03651755 -0.00709374 0.01458533 -0.03692278 ## 2 0.4981238 0.15963917 0.09283620 0.06414651 0.04801941 0.05863810 ## 3 0.2844999 0.21436936 0.35014919 0.53739806 0.27843424 0.22990963 ## 4 1.1367885 1.21013948 0.41679142 0.16627797 0.12988313 0.08478769 ## 5 -0.1532006 -0.12460034 -0.12144246 -0.15889274 -0.08557822 -0.06813159 ## hair dress blonde mall shopping clothes ## 1 0.43807926 0.14905267 0.06137340 0.60368108 0.79806891 0.5651537331 ## 2 -0.04484083 0.07201611 -0.01146396 -0.08724304 -0.03865318 -0.0003526292 ## 3 0.23612853 0.39407628 0.03471458 0.48318495 0.66327838 0.3759725120 ## 4 2.55623737 0.53852195 0.36134138 0.62256686 0.27101815 1.2306917174 ## 5 -0.20498730 -0.14348036 -0.02918252 -0.18625656 -0.22865236 -0.1865419798 ## hollister abercrombie die death drunk drugs ## 1 4.1521844 3.96493810 0.043475966 0.09857501 0.035614771 0.03443294 ## 2 -0.1678300 -0.14129577 0.009447317 0.05135888 -0.086773220 -0.06878491 ## 3 -0.0553846 -0.07417839 0.037989066 0.11972190 -0.009688746 -0.05973769 ## 4 0.1610784 0.26324494 1.712181870 0.93631312 1.897388200 2.73326605 ## 5 -0.1557662 -0.14861104 -0.094875180 -0.08370729 -0.087520105 -0.11423381
By examining whether the clusters fall above or below the mean level for each interest category, we can begin to notice patterns that distinguish the clusters from each other.
In practice, this involves printing the cluster centers and searching through them for any patterns or extreme values
Given this subset of the interest data, we can already infer some characteristics of the clusters. In the following table, each cluster is shown with the features that most distinguish it from the other clusters.
First, save cluster IDs back to the original data
# apply the cluster IDs to the original data frame teens$cluster <- teen_clusters$cluster
Using the aggregate() function, we can link the clusters to other unused characteristics/variables of the clusters.
# mean number of friends by cluster aggregate(data = teens, friends ~ cluster, mean)
## cluster friends ## 1 1 41.43054 ## 2 2 32.57333 ## 3 3 37.16185 ## 4 4 30.50290 ## 5 5 27.70052