Predictive Data

Already complete Code

As we move on from week to week and task to task, the code that you have already completed, will stay on the template but will not run, this is possible by adding eval=FALSE to the corresponding code chunk. Note that the libraries need to be linked to this program as well.

# Install and load necessary libraries
#install.packages("ggplot2") # Install ggplot2 for plotting, if you have already installed the packages, comment this out by enterring a # in front of this command
#install.packages("scales")  # Install scales for formatting
#install.packages("moments") # Install moments for skewness and kurtosis
library(ggplot2)            # Load ggplot2 library
library(scales)             # Load scales library
library(moments)

Setting up your directory in your computer.

This needs to be addressed here.

# Check the current working directory
getwd()

## [1] "C:/Users/cdaniels/OneDrive - National University/Documents"

# in the next line, change the directory to the place where you saved the
# data file, if you prefer you can save your data.csv file in the directory
# that command 7 indicated.
# for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents")

# Set the working directory to where the data file is located
# This ensures the program can access the file correctly

setwd("C:/Users/cdaniels/OneDrive - National University/Documents")

### Choose an already existing directory in your computer.

Setting up your personalized data

# Read the CSV file
# The header parameter ensures column names are correctly read
# sep defines the delimiter (comma in this case)
# stringsAsFactors prevents automatic conversion of strings to factors
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

##########################################################
# Define variables A and B based on your student ID
# A represents the first 3 digits, B represents the last 3 digits
A <- 34
B <- 99
Randomizer <- A + B # Randomizer ensures a consistent seed value for reproducibility


# Generate a random sample of 500 rows from the dataset
set.seed(Randomizer) # Set the seed for reproducibility
sample_size <- 500
df <- df[sample(nrow(df), sample_size, replace = TRUE), ] # Sample the dataset

write.csv(df, file = "my_data.csv", row.names = FALSE) # this command may take some time to run once it is done, it will create the desired data file locally in your directory

Knit your file

As practice, you may want now to knit your file in an html. To do this, you should click on the knit button on the top panel, and wait for the rendering file. The HTML will open once it is done for you to review.

It is recommended to practice with RMD and download and review the following cheatsheets: https://rmarkdown.rstudio.com/lesson-15.HTML

In addition, you may want to alter some of the editor components and re-knit your file to gain some knowledge and understanding of RMD. For a complete tutorial, visit: https://rmarkdown.rstudio.com/lesson-2.html

df <- read.csv("my_data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

Cleaning up your data

Step 0. Now that you read the file, you want to learn few information about your data

The following commands will not be explained here, do your research, review your csv file and answer the questions related with this part of your code.

# Basic exploratory commands
nrow(df)       # Number of rows in the dataset

## [1] 500

length(df)     # Number of columns (or variables) in the dataset

## [1] 17

str(df)        # Structure of the dataset (data types and a preview)

## 'data.frame':    500 obs. of  17 variables:
##  $ ID           : int  1066 1059 1046 971 1097 993 1019 18 1075 953 ...
##  $ Gender       : Factor w/ 3 levels "Female","Male",..: 1 1 1 1 3 2 1 2 2 2 ...
##  $ Age          : int  27 26 27 27 28 30 26 28 29 32 ...
##  $ Height       : int  168 160 168 168 185 182 160 178 175 178 ...
##  $ Weight       : int  65 55 65 65 85 80 55 78 70 75 ...
##  $ Education    : Factor w/ 5 levels "","Bachelor's",..: 2 2 2 2 3 4 2 3 5 4 ...
##  $ Income       : int  45000 48000 45000 45000 NA 65000 48000 38000 60000 NA ...
##  $ MaritalStatus: Factor w/ 3 levels "","Married","Single": 3 3 3 3 3 2 3 2 2 2 ...
##  $ Employment   : Factor w/ 3 levels "","Employed",..: 2 2 2 2 3 2 2 2 2 2 ...
##  $ Score        : num  6.2 6.1 6.2 6.2 5.7 NA 6.1 6.7 7.8 7.5 ...
##  $ Category     : Factor w/ 4 levels "","A","B","C": 3 3 3 3 2 4 3 2 4 2 ...
##  $ Color        : Factor w/ 3 levels "Art","Music",..: 1 1 1 1 3 2 1 3 3 3 ...
##  $ Hobby        : Factor w/ 3 levels "Blue","Green",..: 1 1 1 1 2 3 1 3 3 1 ...
##  $ Happiness    : Factor w/ 5 levels "","Photography",..: 3 4 3 3 2 5 4 2 5 3 ...
##  $ Location     : num  7 7 7 7 6 8.5 7 8 9 8 ...
##  $ X            : Factor w/ 4 levels "","City","Rural",..: 2 3 2 2 3 2 3 3 1 2 ...
##  $ X.1          : logi  NA NA NA NA NA NA ...

summary(df)    # Summary statistics for each column

##        ID            Gender         Age            Height          Weight     
##  Min.   :   1.0   Female:183   Min.   :25.00   Min.   :155.0   Min.   :54.00  
##  1st Qu.:  44.5   Male  :227   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:65.00  
##  Median :1000.0   Other : 90   Median :28.00   Median :175.0   Median :70.00  
##  Mean   : 692.5                Mean   :28.55   Mean   :173.6   Mean   :70.97  
##  3rd Qu.:1060.0                3rd Qu.:30.00   3rd Qu.:182.0   3rd Qu.:80.00  
##  Max.   :1117.0                Max.   :34.00   Max.   :190.0   Max.   :90.00  
##                                NA's   :22      NA's   :19      NA's   :44     
##        Education       Income      MaritalStatus      Employment 
##             : 18   Min.   :32000          : 25             : 15  
##  Bachelor's :180   1st Qu.:45000   Married:205   Employed  :382  
##  High School: 98   Median :48000   Single :270   Unemployed:103  
##  Master's   :124   Mean   :51616                                 
##  PhD        : 78   3rd Qu.:60000                                 
##  NA's       :  2   Max.   :70000                                 
##                    NA's   :76                                    
##      Score       Category    Color       Hobby           Happiness  
##  Min.   :5.500    :  6    Art   :172   Blue :207              :  6  
##  1st Qu.:6.100   A:163    Music :144   Green:139   Photography: 90  
##  Median :6.200   B:190    Sports:184   Red  :154   Reading    :151  
##  Mean   :6.726   C:141                             Swimming   : 89  
##  3rd Qu.:7.500                                     Traveling  :164  
##  Max.   :8.900                                                      
##  NA's   :70                                                         
##     Location          X         X.1         
##  Min.   :6.000         :  4   Mode:logical  
##  1st Qu.:7.000   City  :220   NA's:500      
##  Median :7.250   Rural :184                 
##  Mean   :7.532   Suburb: 92                 
##  3rd Qu.:8.500                              
##  Max.   :9.000                              
##  NA's   :22

Your Turn

Please answer the following questions, by typing information after the question.

Question 1

What type of variables does your file include?

Answer 1:

Question 2

Specific data types?

Answer 2:

Question 3

Are they read properly?

Answer 3:

Question 4

Are there any issues ?

Answer 4:

Question 5

Does your file includes both NAs and blanks?

Answer 5:

Question 6

How many NAs do you have and

Answer 6:

Question 7

How many blanks?

Answer 7:

Cleanup Continued

Step 1: Handling both blanks and NAs is not simple so first we want to eliminate some of those, let’s eliminate the blanks and change them to NAs

# Step 1: Replace blanks with NAs
df[df == ""] <- NA

# Define possible categorical columns
factor_columns <- c("Gender", "Education", "Rating", "MaritalStatus", 
                    "Category", "Employment", "Color", "Hobby", "Location")

# Keep only those that actually exist in df
factor_columns <- factor_columns[factor_columns %in% names(df)]

# Convert them to factors (only if any exist)
if (length(factor_columns) > 0) {
  df[factor_columns] <- lapply(df[factor_columns], function(col) as.factor(as.character(col)))
}

# Debug: print which factor columns were found
cat("Factor columns in df:\n")

## Factor columns in df:

print(factor_columns)

## [1] "Gender"        "Education"     "MaritalStatus" "Category"     
## [5] "Employment"    "Color"         "Hobby"         "Location"

Cleanup Continued

Step 2: Count NAs in the entire dataset

#
# Step 2: Count NAs in the entire dataset


# Count the total number of NAs in the dataset
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values

## [1] 829

Your Turn

Please answer the following questions, by typing information after the question.

Question 8

Explain what the printed number is, what is the information that relays and how can you use it in your analysis?

Answer 8:

Clean Up Continued

Step 3: Count rows with NAs.

#
# Step 3: Count rows with NAs
#

# Count rows with at least one NA
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
Percent_row_NA <- percent(rows_with_nas / nrow(df)) # Percentage of rows with NAs
rows_with_nas

## [1] 500

Percent_row_NA

## [1] "100%"

Your Turn

Question 9

How large is the proportion of the rows with NAs, we can drop up to 5%?

Answer 9:

Question 10

Do you think that would be wise to drop the above percent?

Answer 10:

Question 11

How this will affect your dataset?

Answer 11:

CleanUp Continued

Step 4: Count columns with NAs

#  
# Step 4: Count columns with NAs

# Count columns with at least one NA
cols_with_nas <- sum(colSums(is.na(df)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(df)) # Percentage of columns with NAs
cols_with_nas

## [1] 13

Percent_col_NA

## [1] "76%"

Your Turn

Question 12

How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we will loose variables and associations but do you think that would be wise to drop the above percent?

Answer 12:

Question 13

How this will affect your dataset?

Answer 13:

Imputation

Step 5: Replace NAs with appropriate values (mean for numeric and integer,mode for factor, “NA” for character)

In later weeks we will learn how to replace the NAs properly based on the descriptive statistics and you will discuss this code.For now, you can assume that by setting the mean of the variable for numeric and mode for categorical it is correct - this is not always the case of course but the code will become much more complicated in that case.

# 
# Step 5: Replace NAs with appropriate values (mean for numeric and integer,
# mode for factor, "NA" for character)
# In later weeks we will learn how to replace the NAs properly based on the
# descriptive statistics and you will discuss this code.
# for now, you can assume that by setting the mean of the variable for numeric
# and mode for categorical it is correct - this is not always the case of course
# but the code will become much more complicated in that case.


# Replace NAs with appropriate values
# Numeric: Replace with the mean if sufficient data is available
# Categorical: Replace with the mode (most common value)
# Character: Replace with the string "NA"
df <- lapply(df, function(col) {
  if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
    if (sum(!is.na(col)) > 10) {
      col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
    } else {
      col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
    }
  } else if (is.factor(col)) { # Factor columns
    mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
    col[is.na(col)] <- mode_val
  } else if (is.character(col)) { # Character columns
    col[is.na(col)] <- "NA" # Replace with "NA"
  }
  return(col) # Return the modified column
})

df <- as.data.frame(df) # Convert the list back to a dataframe


#
# following the above method to impute, has now changed some of the statistics


# Check the updated dataset and ensure no remaining NAs
summary(df)

##        ID            Gender         Age            Height          Weight     
##  Min.   :   1.0   Female:183   Min.   :25.00   Min.   :155.0   Min.   :54.00  
##  1st Qu.:  44.5   Male  :227   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:65.00  
##  Median :1000.0   Other : 90   Median :28.00   Median :175.0   Median :70.97  
##  Mean   : 692.5                Mean   :28.55   Mean   :173.6   Mean   :70.97  
##  3rd Qu.:1060.0                3rd Qu.:30.00   3rd Qu.:182.0   3rd Qu.:80.00  
##  Max.   :1117.0                Max.   :34.00   Max.   :190.0   Max.   :90.00  
##                                                                               
##        Education       Income      MaritalStatus      Employment 
##  Bachelor's :200   Min.   :32000   Married:205   Employed  :397  
##  High School: 98   1st Qu.:45000   Single :295   Unemployed:103  
##  Master's   :124   Median :51616                                 
##  PhD        : 78   Mean   :51616                                 
##                    3rd Qu.:60000                                 
##                    Max.   :70000                                 
##                                                                  
##      Score       Category    Color       Hobby           Happiness  
##  Min.   :5.500   A:163    Art   :172   Blue :207              :  0  
##  1st Qu.:6.100   B:196    Music :144   Green:139   Photography: 90  
##  Median :6.700   C:141    Sports:184   Red  :154   Reading    :151  
##  Mean   :6.726                                     Swimming   : 89  
##  3rd Qu.:7.500                                     Traveling  :170  
##  Max.   :8.900                                                      
##                                                                     
##     Location        X         X.1         
##  7      :178         :  0   Mode:logical  
##  8      : 88   City  :224   NA's:500      
##  9      : 69   Rural :184                 
##  6      : 68   Suburb: 92                 
##  8.5    : 60                              
##  7.5    : 18                              
##  (Other): 19

Your Turn

Essay Question

Run summary(df) and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. Are there any more NA’s in your file?What is the information that is printed by the summary? How can this be interpreted? what are your observations? Verify the effects of imputation, and explain in detail. Compare the updated summary with the earlier statistics and note changes. Explain everything that you obsevrve.

Answer

Descriptive Statistics

Step 6: Create descriptive statistics for all variables

We run all the descriptive statistics for all the numeric variables

################################################################### 
# 
# Step 6: Create descriptive statistics for all variables
# We run all the descriptive statistics for all the numeric variables
#
###################################################################
# Initialize a function to compute descriptive statistics
compute_stats <- function(column, name) {
  if (is.numeric(column) || is.integer(column)) {
    data.frame(
      Variable = name,
      Mean = round(mean(column, na.rm = TRUE), 2),
      Median = round(median(column, na.rm = TRUE), 2),
      St.Deviation = round(sd(column, na.rm = TRUE), 2),
      Range = round(diff(range(column, na.rm = TRUE)), 2),
      IQR = round(IQR(column, na.rm = TRUE), 2),
      Skewness = round(skewness(column, na.rm = TRUE), 2),
      Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
      stringsAsFactors = FALSE
    )
  } else {
    NULL
  }
}

# Apply the function to each numeric or integer column in the dataset
descriptive_stats <- do.call(
  rbind,
  lapply(names(df), function(col) compute_stats(df[[col]], col))
)

# Print the descriptive statistics dataframe
descriptive_stats

##   Variable     Mean   Median St.Deviation   Range     IQR Skewness Kurtosis
## 1       ID   692.48  1000.00       479.97  1116.0  1015.5    -0.64     1.44
## 2      Age    28.55    28.00         2.08     9.0     3.0     0.56     3.06
## 3   Height   173.56   175.00         8.94    35.0    17.0    -0.16     1.91
## 4   Weight    70.97    70.97         9.87    36.0    15.0    -0.04     2.08
## 5   Income 51615.57 51615.57      8453.85 38000.0 15000.0     0.14     2.53
## 6    Score     6.73     6.70         0.83     3.4     1.4     0.60     2.53

Descriptive Statistics Continued

Step 7: Print Descriptive Statistics

Now you have all the descriptive statistics for all numeric variables Create a professional table in your paper. The library(KableExtra), can help you create the table here. If you have no programming experience you can cut and paste in Excel and beautify the table in Excel.

#############################################################
# 
# Step 7: Print Descriptive Statistics
# Now you have all the descriptive statistics for all numeric variables
# Create a professional table in your paper.
# the library(KableExtra), can help you create the table here.
# if you have no programming experience you can cut and paste in Excel
# and beautify the table in Excel
#############################################################
  
  print("Descriptive Statistics:")

## [1] "Descriptive Statistics:"

  print(descriptive_stats)

##   Variable     Mean   Median St.Deviation   Range     IQR Skewness Kurtosis
## 1       ID   692.48  1000.00       479.97  1116.0  1015.5    -0.64     1.44
## 2      Age    28.55    28.00         2.08     9.0     3.0     0.56     3.06
## 3   Height   173.56   175.00         8.94    35.0    17.0    -0.16     1.91
## 4   Weight    70.97    70.97         9.87    36.0    15.0    -0.04     2.08
## 5   Income 51615.57 51615.57      8453.85 38000.0 15000.0     0.14     2.53
## 6    Score     6.73     6.70         0.83     3.4     1.4     0.60     2.53

Your Turn

Essay Question

Review and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. How can this be interpreted? what are your observations? Verify the descriptive statistics, and explain in detail. Explain everything that you obsevrve. Complete your research compare your variables and complete your paper

Answer

Visual Representations

Step 8: Create graphs using ggplot2

For this part there are parts that you will need to change to create your graphs. The example is set to work with Income. Make the necessary changes to create the rest of the graphs. You may also want to change the colors, the dimensions etc…

  #######################################################################
  # 
  # Step 8: Create graphs using ggplot2
  # For this part there are parts that you will need to change to create 
  # your graphs.
  # The example is set to work with Income
  # Make the necessary changes to create the rest of the graphs
  # You may also want to change the colors, the dimensions etc...
  #############################################################
  
  #############################################################
  #
  # STEP 8a: Create a bargraph or a histogram
  # Explain what graph was that and why?
  # Set col to the desired column name
  #############################################################
  #
  ##
  # In this code we start you of with an example of Happiness, later in the code
  # you should replace this with your desired variable.
  #
  
    col = "Happiness"  # This is an example, try to do the same with a different variable

Bargraph - if the variable of your choice is categorical

  # Assume df is your dataframe and col is the column name (as string)
if (is.factor(df[[col]])) {
  # Bar graph for factors
  ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
    geom_bar() +
    labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
    theme_minimal() +
    theme(legend.position = "right")
  
}

Histogram - If the varaibles of your choice is Numerical

You can also copy the chunk and create more graphs by resetting the col variable appropriately

if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 30, fill = "steelblue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

Your Turn

Essay Question

Now that you can observe graphically your data, explain the importance of graphical representations and how this helps to communicate data with other parties. Explain what graph was that and why?

Answer

Your Turn

STEP 8b: Create a boxplot and a Histogram for numeric variables note the the Bin width cannot be set up in the same way to work with Age or Happiness that has a small range and Income that the range is in thousands. Change this appropriately

Please note that this part of the code will not run for the demo code. You will need to change the value of eval=FALSE to eval=TRUE, after you introduce your code, to run it and add it to your knitted file.

 #############################################################
      #
      # STEP 8b: Create a boxplot  and Histogram for numeric variables
      # note the the Bin width cannot be set up in the same way to work with 
      # Age or Happiness that has a small range and Income that the range is in thousands
      # Change this appropriately
      #############################################################
      #
      # Choose a numeric variable (i.e., Age) set the col variable to the name of the column then you rerun the code that is commented out here.

#col = Age Add the variable of your choice  

# Uncomment the code and you will create a Bar graph or a Histogram of a different variable here.
# Do not forget to change the value of eval=TRUE to run and knit this chunk
    
#  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
#      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
#        geom_bar() +
#        labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
#        theme_minimal() +
#        theme(legend.position = "right")
#    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
#      ggplot(df, aes(x = .data[[col]])) +
#        geom_histogram(binwidth = 0.3) +
#        labs(title = paste("Histogram for", col), x = col, y = "Count") +
#        theme_minimal()

Your Turn

Essay Question

Now explain this graph. Focus on the information extracted, anomalies, outliers, relationships.

Answer

Your Turn

***Step 8c: NOTE that you should run this part with the latest value of col. Do not forget to change the eval=TRUE to knit it.

Boxplot for numeric variables

       #############################################################
       #
       # Step 8c
       # NOTE that you should run this part of the code after you 
       #  copy the graph that the previous code creates. Boxplot for numeric variables
       #############################################################
      # The next 5 lines will run only if the col is numeric, otherwise will give you an error.
      
  
       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

Your Turn

Essay Question

Explain the findings of your Boxplot. Are there any outliers? What is the IQR? Focus on the information extracted, anomalies, outliers, relationships.

Answer

Tabular Representations

Step 9: Tables

Creating tables to understand how the different categorical variables interconnect. Tabular information can be provided in both tables and parallel barplots. The following is an example on two variables, choose two others to get more valuable insights.

#############################################################
#
# Step 9
#
# Creating tables to understand how the different categorical variables
# interconnect
# Tabular information can be provided in both tables and parallel barplots.
# The following is an example on two variables, choose two others to get
# more valuable insights.
#############################################################

Gender_Education <- table(df$Education, df$Gender)
Gender_Education # what does this information tells you?

##              
##               Female Male Other
##   Bachelor's     176   11    13
##   High School      3   27    68
##   Master's         4  120     0
##   PhD              0   69     9

# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Gender_Education) # Add totals to your table

##              
##               Female Male Other Sum
##   Bachelor's     176   11    13 200
##   High School      3   27    68  98
##   Master's         4  120     0 124
##   PhD              0   69     9  78
##   Sum            183  227    90 500

color <- c("red","blue","yellow","green")
names <- c("Bachelor's","High School", "Master's","PhD")
barplot(Gender_Education, col=color, beside= TRUE, main = "Education by Gender", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Gender_Education))

##              
##               Female Male Other Sum
##   Bachelor's     176   11    13 200
##   High School      3   27    68  98
##   Master's         4  120     0 124
##   PhD              0   69     9  78
##   Sum            183  227    90 500

Making Pretty Tables

library(knitr)
library(kableExtra)

# Create the contingency table
Gender_Education <- table(df$Education, df$Gender)

# Add row and column totals
Gender_Education_margins <- addmargins(Gender_Education)

# Make a clean and beautiful table with kable
kable(Gender_Education_margins, caption = "Gender by Education Level", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header

Gender by Education Level
	Female	Male	Other	Sum
Bachelor’s	176	11	13	200
High School	3	27	68	98
Master’s	4	120	0	124
PhD	0	69	9	78
Sum	183	227	90	500

Your Turn

Essay Question

Explain the table in details. Focus on the information extracted, anomalies, outliers, relationships.

Answer

Predictive Modeling

Step 10 - Linear Regression and Correlation

Use the following chunk as a compass. Choose two numeric variables and run the following regression. Choose different variables than the ones presented below.

#############################################################
#
# Step 10 - Linear Regression and Scatterplots
# Choose two numeric variables and run the following regression.
# Do not use the following two variables
# The code is presented as an example
#
# We separate the numerical variables and review their relationships
# The numerical variables are in columns 3,4,5,7, 10, 15
#############################################################

temp_df <- df[c(3:5,7,10,15)] # we only select the numeric variables
pairs(temp_df) # this creates a correlation matrix

Your Turn

Essay Question

Explain the Correlation Matrix and the heat map in detail, what relationships can you identify, what trends, why they are important? Can you tie this to your beliefs and understanding of similar data?

Answer

#############################################################
#
# Step 11 Run a regression model
# Make the necessary changes below to run your own regression
# Answer the questions in your paper
#
#############################################################


r <- lm(Income~Age, data=df) # it runs the least squares
r

## 
## Call:
## lm(formula = Income ~ Age, data = df)
## 
## Coefficients:
## (Intercept)          Age  
##       -3348         1925

summary(r) # Information about your variables, R^2 and p value are printed

## 
## Call:
## lm(formula = Income ~ Age, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22411  -3635   1055   3384  18384 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3348.3     4583.6   -0.73    0.465    
## Age           1925.3      160.1   12.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7450 on 498 degrees of freedom
## Multiple R-squared:  0.225,  Adjusted R-squared:  0.2234 
## F-statistic: 144.6 on 1 and 498 DF,  p-value: < 2.2e-16

# In your example you should change the title and the labels of the axis appropriately
# Change Colors

Your Turn

Essay Question

Explain the Results that you receive What does the R^2 means in this example? what about the p-value? Why they are important? Can you tie this to your beliefs and understanding of similar data?

Answer

Visual representations

#############################################################
#
# Step 12
#
# Create a scatterplot and add the regression line.
#############################################################

plot(df$Age,df$Income, col = "blue", main = "Income vs Age", xlab = "Age", ylab= "Income") # it plots the scatterplot
abline(reg=r, col = "red")          # it adds the regression line

Your Turn

Essay Question

How the Scatterplot provides more or different perspective to the researcher? Please describe the plot but also share your insights of this exploration.

Answer

Your Turn

Complete the steps that shown above for two different numerical variables

#############################################################
#
# Step 11 Run a regression model
# Make the necessary changes below to run your own regression
# Answer the questions in your paper
#
#############################################################

# CHANGE the variables Income and Age, but always choose numerical variables.
# Do not forget to change the eval=TRUE for this and the following chunk before you knit.

r <- lm(Height~Weight, data=df) # it runs the least squares
r

## 
## Call:
## lm(formula = Height ~ Weight, data = df)
## 
## Coefficients:
## (Intercept)       Weight  
##    115.1916       0.8224

summary(r) # Information about your variables, R^2 and p value are printed

## 
## Call:
## lm(formula = Height ~ Weight, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5613  -0.6501   0.6255   2.1069   6.4387 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 115.19156    1.21415   94.87   <2e-16 ***
## Weight        0.82244    0.01694   48.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.737 on 498 degrees of freedom
## Multiple R-squared:  0.8255, Adjusted R-squared:  0.8251 
## F-statistic:  2356 on 1 and 498 DF,  p-value: < 2.2e-16

# In your example you should change the title and the labels of the axis appropriately
# Change Colors

Your Regression DIagnostics

#############################################################
#
# Step 12
#
# Create a scatterplot and add the regression line.
#############################################################

plot(df$Height,df$Weight, col = "blue", main = "Height vs Weight", xlab = "Height", ylab= "Weight") # it plots the scatterplot
abline(reg=r, col = "red")          # it adds the regression line

Your Turn

Essay Question Explain the Results that you receive What does the R^2 means in this example? what about the p-value? Why they are important? Can you tie this to your beliefs and understanding of similar data? How the Scatterplot provides more or different perspective to the researcher? Please describe the plot but also share your insights of this exploration.

Answer

Optional - Run and Explain a Multiple Linear Regression

An example on three predictors is shown below. You can choose your own variables or provide an explanation of the findings for this example

#############################################################
#
# Step 13 - Optional
#
# In case you want to add more variables in your model the following 
# Example is provided.
#############################################################

r2 <- lm(Income~Age+Gender+Education, data=df) # it runs the least squares
r2

## 
## Call:
## lm(formula = Income ~ Age + Gender + Education, data = df)
## 
## Coefficients:
##          (Intercept)                   Age            GenderMale  
##              61979.6                -585.5                1319.4  
##          GenderOther  EducationHigh School     EducationMaster's  
##               4208.4               -3484.7               14060.2  
##         EducationPhD  
##              14049.9

summary(r2) # Information about your variables, R^2 and p value are printed

## 
## Call:
## lm(formula = Income ~ Age + Gender + Education, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13137  -2341   -368   5207   9367 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           61979.6     5177.8  11.970  < 2e-16 ***
## Age                    -585.5      194.0  -3.019 0.002668 ** 
## GenderMale             1319.4     1133.7   1.164 0.245060    
## GenderOther            4208.4     1104.5   3.810 0.000156 ***
## EducationHigh School  -3484.7     1039.0  -3.354 0.000858 ***
## EducationMaster's     14060.2     1154.5  12.178  < 2e-16 ***
## EducationPhD          14049.9     1125.5  12.484  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5043 on 493 degrees of freedom
## Multiple R-squared:  0.6484, Adjusted R-squared:  0.6441 
## F-statistic: 151.5 on 6 and 493 DF,  p-value: < 2.2e-16

# In your example you should change the title and the labels of the axis appropriately
# Change Colors
# Explain the outcome.
# Run your model
r2 <- lm(Income ~ Age + Gender + Education, data = df)

# View the model and summary
r2

## 
## Call:
## lm(formula = Income ~ Age + Gender + Education, data = df)
## 
## Coefficients:
##          (Intercept)                   Age            GenderMale  
##              61979.6                -585.5                1319.4  
##          GenderOther  EducationHigh School     EducationMaster's  
##               4208.4               -3484.7               14060.2  
##         EducationPhD  
##              14049.9

summary(r2)

## 
## Call:
## lm(formula = Income ~ Age + Gender + Education, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13137  -2341   -368   5207   9367 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           61979.6     5177.8  11.970  < 2e-16 ***
## Age                    -585.5      194.0  -3.019 0.002668 ** 
## GenderMale             1319.4     1133.7   1.164 0.245060    
## GenderOther            4208.4     1104.5   3.810 0.000156 ***
## EducationHigh School  -3484.7     1039.0  -3.354 0.000858 ***
## EducationMaster's     14060.2     1154.5  12.178  < 2e-16 ***
## EducationPhD          14049.9     1125.5  12.484  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5043 on 493 degrees of freedom
## Multiple R-squared:  0.6484, Adjusted R-squared:  0.6441 
## F-statistic: 151.5 on 6 and 493 DF,  p-value: < 2.2e-16

# Set up plotting space: 2 rows, 2 columns
par(mfrow = c(2, 2))

# 1. Residuals vs Fitted plot
# Checks for non-linearity, unequal error variance
plot(r2, which = 1)

# 2. Normal Q-Q plot
# Checks if residuals are normally distributed
plot(r2, which = 2)

# 3. Scale-Location plot
# Checks homoscedasticity (constant variance)
plot(r2, which = 3)

# 4. Residuals vs Leverage plot
# Finds influential observations
plot(r2, which = 5)

# Reset plotting space to normal
par(mfrow = c(1,1))

# --- Additional Useful Diagnostics ---

# 5. Histogram of residuals
hist(r2$residuals,
     main = "Histogram of Residuals",
     xlab = "Residuals",
     col = "lightblue",
     border = "white")

# 6. Residuals vs each predictor
# Helps to spot non-linear patterns individually
par(mfrow = c(1, 3))
plot(df$Age, r2$residuals, main = "Residuals vs Age", xlab = "Age", ylab = "Residuals")
abline(h = 0, col = "red")
plot(df$Gender, r2$residuals, main = "Residuals vs Gender", xlab = "Gender", ylab = "Residuals")
abline(h = 0, col = "red")
plot(df$Education, r2$residuals, main = "Residuals vs Education", xlab = "Education", ylab = "Residuals")
abline(h = 0, col = "red")

par(mfrow = c(1, 1))

# 7. Cook's Distance
# Identifies influential points
cooksd <- cooks.distance(r2)
plot(cooksd, type = "h", main = "Cook's Distance", ylab = "Cook's Distance")
abline(h = 4/length(cooksd), col = "red", lty = 2)  # common threshold

Predictive Data

C.Daniels

2025-04-28

Already complete Code

Setting up your directory in your computer.

Setting up your personalized data

Knit your file

Cleaning up your data

Your Turn

Cleanup Continued

Cleanup Continued

Your Turn

Clean Up Continued

Your Turn

CleanUp Continued

Your Turn

Imputation

Your Turn

Descriptive Statistics

Descriptive Statistics Continued

Your Turn

Visual Representations

Bargraph - if the variable of your choice is categorical

Histogram - If the varaibles of your choice is Numerical

Your Turn

Your Turn

Your Turn

Your Turn

Your Turn

Tabular Representations

Making Pretty Tables

Your Turn

Predictive Modeling

Your Turn

Your Turn

Visual representations

Your Turn

Your Turn

Your Regression DIagnostics

Your Turn

Optional - Run and Explain a Multiple Linear Regression